Notes and thoughts from Tony

Scrapy 防 ban 指南

Comments

大多数网站都使用了不同程度的防爬机制,要想抓取到更多的信息,就必须采取相应的策略。

关闭 ROBOTSTXT_OBEY

settings.py 中,否则大部分网站对爬虫都不怎么欢迎:

ROBOTSTXT_OBEY = False

动态设置 user agent

执行 $ scrapy shell https://tonyh2021.github.io/articles/2017/12/15/Scrapy-Tutorial.html 之后,查看 request:

>>> request.headers
{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.4.0 (+http://scrapy.org)'], b'Accept-Encoding': [b'gzip,deflate']}

User-AgentScrapy blablabla...,明显被卖了。

Scrapy 提供了修改 user agent 的中间件。如下代码:

import logging
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import fake_useragent
from fake_useragent import FakeUserAgentError
# fake_useragent 是一个可以随机获取 user agent 的第三方库

class RandomUserAgentMiddleware(UserAgentMiddleware):
    # 使用 useragent 池,避免被 ban
    # 注意:需在 settings.py 中进行相应的设置

    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        try:
            ua = fake_useragent.UserAgent().random
        except FakeUserAgentError:
            logging.log(logging.ERROR, 'Get UserAgent Error')

        # 记录当前使用的useragent
        logging.log(logging.INFO, 'Current UserAgent: %s' % (ua))
        request.headers['User-Agent'] = ua

同时需要在 settings.py 中进行相应的设置:

DOWNLOADER_MIDDLEWARES = {
    'scrapydemo.middlewares.RandomUserAgentMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

我执行更新操作时总是会遇到错误,可能是跟代理或长城有关,而且在爬知乎时经常会遇到浏览器版本过低的情况,所以最终没有使用 fake_useragent

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.update()
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached

最终的代码:

import logging
import random
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RandomUserAgentMiddleware(UserAgentMiddleware):
    # RandomUserAgentMiddleware

    def __init__(self, settings, user_agent='Scrapy'):
        super(RandomUserAgentMiddleware, self).__init__()
        self.user_agent = user_agent
        user_agent_list_file = settings.get('USER_AGENT_LIST')
        if not user_agent_list_file:
            # If USER_AGENT_LIST_FILE settings is not set,
            # Use the default USER_AGENT or whatever was
            # passed to the middleware.
            ua = settings.get('USER_AGENT', user_agent)
            self.user_agent_list = [ua]
        else:
            with open(user_agent_list_file, 'r') as f:
                self.user_agent_list = [line.strip() for line in f.readlines()]

    @classmethod
    def from_crawler(cls, crawler):
        obj = cls(crawler.settings)
        crawler.signals.connect(obj.spider_opened,
                                signal=signals.spider_opened)
        return obj

    def process_request(self, request, spider):
        user_agent = random.choice(self.user_agent_list)
        logging.log(logging.INFO, "Current User-Agent: %s" % (user_agent))
        if user_agent:
            request.headers.setdefault('User-Agent', user_agent)

settings.py 中:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 \
            (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
USER_AGENT_LIST = 'useragents.txt'

...

DOWNLOADER_MIDDLEWARES = {
    'scrapydemo.middlewares.RandomUserAgentMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

有些网站可能使用 cookie 来识别爬虫,所以要禁用此配置。

settings.py 中:

COOKIES_ENABLED = False

设置 DOWNLOAD_DELAY

设置下载延迟(2 或更大)。可以参阅 DOWNLOAD_DELAY

settings.py 中:

DOWNLOAD_DELAY = 2

使用 Google cache

如果可以的话,最好使用 Google cache 抓取页面。

使用动态代理

在爬虫的 ip 被封掉后,可以动态换一个 ip,继续爬取。

国内可以使用西刺代理,国外可以使用Tor project,也可以使用开源项目scrapoxy

这个内容有点多,单独开一篇介绍。

分布式爬虫

使用高分布式爬虫,比如 Crawlera