크롤링과 스크레이핑 - 링크에 있는 것 내려받기

티스토리 뷰

Machine Learning

크롤링과 스크레이핑 - 링크에 있는 것 내려받기

˙ᵕ˙ 2020. 9. 17. 00:59

상대경로 전개

from urllib.parse import urljoin

base = 'http://example.com/html/a.html'

print(urljoin(base, 'b.html'))
print(urljoin(base, 'sub/c.html'))
print(urljoin(base, '../index.html'))
print(urljoin(base, '../img/hoge.png'))
print(urljoin(base, '../css/hoge.css'))
print(urljoin(base, './doc/car.html'))
print(urljoin(base, '../../../index.html'))
print(urljoin(base, '/hoge.html'))
print(urljoin(base, '//hello.workd.org/test/shop.jsp'))
print(urljoin(base, 'http://otherExample.com/wiki'))

모든 페이지 한번에 다운받기

# 파이썬 매뉴얼을 재귀적으로 다운받는 프로그램
# 모듈 읽어 들이기
from bs4 import BeautifulSoup
from urllib.request import *
from urllib.parse import *
from os import makedirs
import os.path
import time
import re

# 이미 처리한 파일인지 확인하기 위한 변수
proc_files = {}
# HTML 내부에 있는 링크를 추출하는 함수 
def enum_links(html, base):
    soup = BeautifulSoup(html, "html.parser")
    links = soup.select("link[rel='stylesheet']") # CSS
    links += soup.select("a[href]") # 링크
    result = []
    # href 속성을 추출하고, 링크를 절대 경로로 변환 
    for a in links:
        href = a.attrs['href']
        url = urljoin(base, href)
        result.append(url)
    return result
# 파일을 다운받고 저장하는 함수
def download_file(url):
    o = urlparse(url)
    savepath = "./" + o.netloc + o.path
    if re.search(r"/$", savepath): # 폴더라면 index.html
        savepath += "index.html"
    savedir = os.path.dirname(savepath)
    # 모두 다운됐는지 확인
    if os.path.exists(savepath): return savepath
    # 다운받을 폴더 생성
    if not os.path.exists(savedir):
        print("mkdir=", savedir)
        makedirs(savedir)
    # 파일 다운받기 
    try:
        print("download=", url)
        urlretrieve(url, savepath)
        time.sleep(1) # 1초 휴식 
        return savepath
    except:
        print("다운 실패: ", url)
        return None        
# HTML을 분석하고 다운받는 함수
def analyze_html(url, root_url):
    savepath = download_file(url)
    if savepath is None: return
    if savepath in proc_files: return # 이미 처리됐다면 실행하지 않음 
    proc_files[savepath] = True
    print("analyze_html=", url)
    # 링크 추출 
    html = open(savepath, "r", encoding="utf-8").read()
    links = enum_links(html, url)
    for link_url in links:
        # 링크가 루트 이외의 경로를 나타낸다면 무시 
        if link_url.find(root_url) != 0:
            if not re.search(r".css$", link_url): continue
        # HTML이라면
        if re.search(r".(html|htm)$", link_url):
            # 재귀적으로 HTML 파일 분석하기
            analyze_html(link_url, root_url)
            continue
        # 기타 파일
        download_file(link_url)
if __name__ == "__main__":
    # URL에 있는 모든 것 다운받기 
    url = "https://docs.python.org/3.8/library/"
    analyze_html(url, url)

'Machine Learning' 카테고리의 다른 글

크롤링과 스크레이핑 - 웹 브라우저를 이용한 스크레이핑(selenium, 헤드리스 파이어폭스) (0)	2020.09.19
크롤링과 스크레이핑 - 로그인이 필요한 사이트에서 다운받기 (0)	2020.09.17
크롤링과 스크레이핑 - CSS 선택자 (0)	2020.09.17
크롤링과 스크레이핑 - BeautifulSoup로 스크레이핑 하기 (0)	2020.09.17
크롤링과 스크레이핑 - urllib.request를 이용한 데이터 다운로드 (0)	2020.09.17

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

물흐르듯 개발하다 대박나기

티스토리 뷰

크롤링과 스크레이핑 - 링크에 있는 것 내려받기

'Machine Learning' 카테고리의 다른 글

티스토리툴바