Scrapy Anti-Ban Guide

Most websites employ varying degrees of anti-scraping mechanisms. To collect more data, you need to adopt corresponding countermeasures. Disable ROBOTSTXTOBEY In , disable this setting — otherwise most websites won't welcome your spider: Dynamically Set the User Agent After running , inspect the request: The is — it's obviously giving you away. Scrapy provides a middleware for modifying the user agent. Here is the code: You also need to add the corresponding settings in : I kept running into errors when doing updates — possibly related to a proxy or the Great Firewall — and when scraping Zhihu I often encountered "browser version too old" errors. So I ultimately did not use . The final code: In : Disable Cookies Some websites may use cookies to identify scrapers, so it is a good idea to disable this. In : Set DOWNLOADDELAY Set a download delay of 2 seconds or more. See for reference. In : Use Google Cache If possible, it is best to scrape pages via . Use Dynamic Proxies After your scraper's IP is blocked, you can dynamically switch to a new IP and continue scraping. In China, you can use Xici Proxy; internationally, you can use the Tor project, or the open-source project scrapoxy. This topic is a bit involved, so I've written a separate post about it. Distributed Crawlers Use a highly distributed crawling service, such as .