내일배움단 -개발일지 2주차 (1)

sogummi 2023. 3. 8. 23:47

7일차

오늘 공부 : 웹개발 3주차 수강

<웹스크래핑(크롤링)>

*새로운 라이브러리 설치 전,혹은 사용 전 오른쪽 하단 ('venv':venv) 잡혔는지 확인
pip install bs4 터미널에 설치
<크롤링을 하려면 2가지가 필요>
(1) 웹에 접속하는 라이브러리 : requests
(2) 데이터 솎는 라이브러리 : bs4

01. 크롤링 기본세팅

import requests
from bs4 import BeautifulSoup

//타겟 URL을 읽어서 HTML를 받아오고,
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=pnt&date=20210829',headers=headers)

//HTML을 BeautifulSoup이라는 라이브러리를 활용해 검색하기 용이한 상태로 만듦
//soup이라는 변수에 "파싱 용이해진 html"이 담긴 상태가 됨
//이제 코딩을 통해 필요한 부분을 추출하면 된다.
soup = BeautifulSoup(data.text, 'html.parser')

select를 이용해서, tr들을 불러오기

a = soup.select_one('#old_content > table > tbody > tr:nth-child(3) > td.title > div > a')

trs = soup.select('#old_content > table > tbody > tr')

movies (tr들) 의 반복문을 돌리기

for tr in trs:

# movie 안에 a 가 있으면,
	a = tr.select_one('td.title > div > a')
if a is not None:
    # a의 text를 찍어본다.
    print (a.text)

02. select / select_one

1-1) beautifulsoup 내 select에 미리 정의된 것들

태그 안의 텍스트를 찍고 싶을 땐 → 태그.text
태그 안의 속성을 찍고 싶을 땐 → 태그['속성']
# 선택자를 사용하는 방법 (copy selector)
soup.select('태그명')
soup.select('.클래스명')
soup.select('#아이디명')

soup.select('상위태그명 > 하위태그명 > 하위태그명')
soup.select('상위태그명.클래스명 > 하위태그명.클래스명')

태그와 속성값으로 찾는 방법

soup.select('태그명[속성="값"]')

한 개만 가져오고 싶은 경우

soup.select_one('위와 동일')

<크롤링>
1-1) 크롬 개발자 도구 참고 (항상 정확하지는 않음 주의!)
1. 원하는 부분에서 마우스 오른쪽 클릭 → 검사
2. 원하는 태그에서 마우스 오른쪽 클릭
3. Copy → Copy selector로 선택자를 복사할 수 있음

1-2)
a = soup.select_one('#old_content > table > tbody > tr:nth-child(3) > td.title > div > a')

trs = soup.select('#old_content > table > tbody > tr')
// movies (tr들) 의 반복문을 돌리기
for tr in trs:
a = tr.select_one('td.title > div > a')
if a is not None:
// movie 안에 a 가 있으면, a의 text를 찍어본다.
print (a.text)

크롤링 연습
trs = soup.select('#old_content > table > tbody > tr')
for tr in trs:
a = tr.select_one('td.title > div > a')
if a is not None:
title = a.text
star = tr.select_one('td.point').text
rank = tr.select_one('td:nth-child(1) >img')['alt']
print(rank,title,star)

<크롤링 정리>
beautifulsoup의 기본틀을 가져와서,
soup안에서 tr들을 먼저 찾은 다음,
그 tr들을 하나씩 돌려가면서,
그 안에서 a태그(제목)을 찾아서 조건문을(;a가 None이 아닐 때) 걸어 출력

8일차

SQL 복습 QUIZ

01)웹개발, 앱개발 종합반의 week 별 체크인 수를 세고 8월 1일 이후에 구매한 고객들만 뽑아보기

SELECT c1.title, c2.week, count(*) FROM courses c1
inner join checkins c2
on c1.course_id = c2.course_id
inner join orders o 
on c2.user_id = o.user_id
where o.created_at >= '2020-08-01'
group by c1.title, c2.week
order by c1.title, c2.week

Left Join : A,B에서 A의 정보에 해당하는 점만 붙임
따라서, 어디에 → 뭐를 붙일건지, 순서가 중요!!
한 쪽에는 없는 데이터를 가지고 통계를 내고 싶을 때 사용

예시) 
select u.name, count(*) from users u 
left join point_users pu on u.user_id = pu.user_id 
where pu.point_user_id is not NULL 
group by u.name 
=> is not NULL 또는 is NULL을 사용해서 데이터가 있거나 없는 사용자들을 통계내고 싶을 때 사용

**복습 필수 문제**
7월10일 ~ 7월19일에 가입한 고객 중,
포인트를 가진 고객의 숫자, 그리고 전체 숫자, 그리고 비율 알아내기

SELECT count(pu.point_user_id) as pnt_user_cnt,
	   count(u.user_id) as tot_user_cnt,  
	   round(count(pu.point_user_id)/count(u.user_id),2) as ratio
	from users u 
	left join point_users pu on u.user_id = pu.user_id
WHERE u.created_at BETWEEN '2020-07-10' and '2020-07-20'

=> ~~select에서 ratio 구할 땐 count끼리 묶어서 나눠야함 주의 ~~ xx 아닌듯. .
++) 밑 쿼리의 각각의 ratio를 구하는 상황에서,
1은 ratio를 구하고 싶은 칼럼을 select에서 정수로 변환 시켰기 때문에, ratio에서도 count를 써서 정수로 변환 시킨 뒤 비율을 계산해야하고,
2는 inner join안에서 정수로 변환 한 칼럼들이기 때문에 만약 한 번 더 count를 해버린다면 결과는 1로 나옴. 즉, 이미 count 한값을 다시 count 하면 => 1로 세어짐 ratio가 1이 나오는 이유

SELECT count(pu.point_user_id) as pnt_user_cnt,
	   count(u.user_id) as tot_user_cnt,  
	   round(count(pu.point_user_id)/count(u.user_id),2) as ratio
	from users u 
	left join point_users pu on u.user_id = pu.user_id
WHERE u.created_at BETWEEN '2020-07-10' and '2020-07-20'


select a.course_id, b.cnt_checkins, a.cnt_total, (b.cnt_checkins/a.cnt_total) as ratio from
(
	select course_id, count(*) as cnt_total from orders
	group by course_id
) a
inner join (
	select course_id, count(distinct(user_id)) as cnt_checkins from checkins
	group by course_id
) b
on a.course_id = b.course_id

Union

: Select를 2번 하지 않고 한 번에 모아서 보고 싶은 경우 사용
위 예시 활용)))
'7월' as month를 사용해서 각각 7월,8월 쿼리를 적고,
괄호로 각각 묶은 다음 중간에 union all 하면 결과가 합쳐진다!
+) union은 ordery by(내부정렬)이 적용되지 않음을 주의
(
select '7월' as month, c1.title, c2.week, count() as cnt from courses c1
inner join checkins c2 on c1.course_id = c2.course_id
inner join orders o on c2.user_id = o.user_id
where o.created_at < '2020-08-01'
group by c1.title, c2.week
order by c1.title, c2.week
)
union all
(
select '8월' as month, c1.title, c2.week, count() as cnt from courses c1
inner join checkins c2 on c1.course_id = c2.course_id
inner join orders o on c2.user_id = o.user_id
where o.created_at >= '2020-08-01'
group by c1.title, c2.week
order by c1.title, c2.week
)

**enrolled_id별 수강완료(done=1)한 강의 갯수를 세어보고, 완료한 강의 수가 많은 순서대로 정렬해보기. user_id도 같이 출력되어야 한다.

SELECT e.enrolled_id, e.user_id, count(*) as cnt from enrolleds e
inner join enrolleds_detail ed
on e.enrolled_id = ed.enrolled_id
where ed.done = 1
group by e.enrolled_id, e.user_id
order by cnt desc

느낀 점 :
웹에서 데이터를 추출하거나 DB에 저장된 데이터를 꺼내서 필요한 데이터를 통계, 정렬해서 뽑아보고,분석해보았다. 또 원하는 결과를 합치는 union도 배웠다. 아직은 골똘히 생각하면서 풀어야하는데 어려워도 재밌고 붙잡고 풀게 된다. 곧 들을 4주차 강의 subquery도 얼른 들을 생각이다.

저작자표시 (새창열림)