Scrapy Proxy Guide
As mentioned before, using dynamic proxies is an effective way to prevent your scraper from getting banned. But when you actually dig into it, you'll run into quite a few problems. Using Dynamic Proxies The main configuration is in the middleware: configuration: Scraping for Proxies You can of course write a tool that periodically scrapes free proxies for your spider to use, or rely on a third-party tool like IPProxyPool. Free proxy sources: http://www.66ip.cn http://cn-proxy.com https://proxy.mimvp.com/free.php http://www.kuaidaili.com http://www.cz88.net/proxy http://www.ip181.com http://www.xicidaili.com https://proxy-list.org/english/index.php https://hidemy.name/en/proxy-list http://www.cnproxy.com/proxy1.html https://free-proxy-list.net/anonymous-proxy.html Integrating SOCKS Sometimes you need to scrape resources that are behind a firewall, which requires tools like SOCKS. However, Scrapy cannot use SOCKS directly, so shadowsocks is not natively supported. A preliminary solution is to set up an HTTP proxy between Scrapy and SOCKS. The shadowsocks client comes with this functionality built in, so you can simply update to point to the local address provided by shadowsocks. But what if the IP address of your shadowsocks server gets blocked? That's where the Tor project comes in. A separate post is needed to introduce Tor.