알쓸개잡 3탄 (Scrapy)

Scrapy란?

Python으로 작성된 오픈 소스 웹 크롤링 프레임워크 입니다.
이 프레임워크는 웹 스크래핑을 위해 디자인 되어 있으며,
Spider를 작성해서 크롤링을 합니다.

Scrapy 설치

아래 명령어를 통해서 Scrapy를 설치합니다.

1
pip install scrapy

Scrapy 프로젝트 생성

아래 명령어를 통해서 Scrapy 프로젝트를 생성할 수 있습니다.

1
scrapy startproject {project-name}

프로젝트를 생성하고 나면 Scrapy에서 자동으로 프로젝트 디렉토리가 생성이 되며,
기본 구조는 아래와 같습니다.

1
2
3
4
5
6
7
8
9
 {project-name}/
 ├── scrapy.cfg
 └── scraper
     ├── items.py
     ├── middlewares.py
     ├── pipelines.py
     ├── settings.py
     └── spiders
         └── __init__.py

Spider

Spider는 크롤링을 하는 방법, 페이지에서 구조화된 데이터를 추출하는 방법을 정의하는 클래스입니다.

1
scrapy genspider {spider-name} {crawl-url}

위의 명령어를 통해서 Spider 클래스를 생성을 하면 scrapy.Spider를 상속하는 클래스가 생성됩니다.

예를 들어 naver를 스크래핑하는 naver spider를 생성하면 아래와 같은 클래스가 생성됩니다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import scrapy


class NaverSpider(scrapy.Spider):
    name = 'naver'
    allowed_domains = ['www.naver.com']
    start_urls = ['http://www.naver.com/']

    def parse(self, response):
        pass

name : spider 이름, Scrapy가 Spider를 찾는 방법이므로 고유해야 합니다.
allowed_domains : spider가 크롤링할 수 있는 도메인,
크롤링 대상 url이 해당 도메인의 하위 도메인이 아니면 오류가 발생합니다.
start_urls : 크롤링 대상 url 리스트입니다.
parse : spider는 start_urls에 있는 url들을 순차적으로 요청으로 보내고 이후에 콜백으로 이 함수를 호출,
해당 함수에서는 응답에 대한 데이터 핸들링을 합니다.

Items

Spider에서 추출한 데이터를 구조화된 데이터로 변환 해주는 역할을 합니다.
(Items는 Entity와 매우 유사하다고 생각됩니다.)

Scrpay는 itemadapter 라이브러리를 통해서, 아래와 같은 유형을 지원합니다.
dictionaries, Item objects, dataclass objects, attrs objects

※ Dataclass objects 방식의 items 선언

1
2
3
4
5
6
from dataclasses import dataclass

@dataclass
class CustomItem:
    one_field: str
    another_field: int

Spider에서는 응답 데이터를 파싱해서 Items Object에 넣어주면 됩니다.

만들어진 Items Object는 원하는 형식에 맞게 파일로 저장이 가능합니다.
ex)
scrapy crawl {spider-name} -o example.csv -t csv
scrapy crawl {spider-name} -o example.json -t json
scrapy crawl {spider-name} -o example.xml -t xml

Pipeline

Spider에서 파싱한 응답 데이터를 Items Ojbect에 담은 이후 해당 데이터는 Pipeline으로 우선 전달이 됩니다.
Pipeline은 Items Object에 담긴 데이터를 출력하기 이전에 유효성 체크, DB에 저장 등의 역할을 합니다.

※ MongoDB에 저장

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pymongo
from itemadapter import ItemAdapter

class MongoPipeline:

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
        return item

Selector

Scrpay에서는 Scrapy Selectors라는 Xpath 또는 CSS 기반의 데이터 추출 방식을 제공합니다.

ex)
xpath selector : response.xpath(’//span/text()’).get()
css selecotr : response.css(‘span::text’).get()

참고

scrapy 공식 문서 : https://scrapy.org/

Scrapy란?#

Scrapy 설치#

Scrapy 프로젝트 생성#

Spider#

Items#

Pipeline#

Selector#

참고#