Scrapy and Tor
Hard to categorize, so it ended up under "Programming Concepts" — you know how it is. Integrating Scrapy with Tor is yet another rabbit hole. Installation Tor (The Onion Router) is free software for enabling anonymous communication. For a more accessible introduction, see Common Questions About TOR — if the link doesn't open, you're not quite ready to step through that door yet. Installing Tor is straightforward: Configuration Currently, Tor cannot be used directly from mainland China — you need to configure a "front proxy" for it. This guide uses shadowsocks as the front proxy. Create a configuration file named under . You can refer to the file there to see all available options and descriptions. Key parameters: Parameter | Description ---- | --- ControlPort | Port for control programs; required if you use tools like nyx Socks5Proxy | Upstream SOCKS proxy port HTTPProxy | Upstream HTTP proxy port HTTPSProxy | Upstream HTTPS proxy port SocksPort | Port through which external programs access Tor MaxCircuitDirtiness | Interval for automatic IP rotation Final configuration: Now in Chrome, use to create a new profile named with manual proxy settings: SOCKS proxy at 127.0.0.1, port 9000 (matching the configured ). Select the newly created profile in and test browsing — it works. I actually spent an entire afternoon on this step. I finally got it right only after carefully reading the documentation to find the correct configuration. (ಥ _ ಥ)... At this point, Scrapy still cannot use Tor directly. There needs to be an intermediate layer that converts the SOCKS proxy provided by Tor into an HTTP proxy. Privoxy Privoxy is an HTTP filtering proxy that is commonly used alongside Tor. Find the file under (create it if it doesn't exist), and add the following: The first line sets Privoxy to listen on port 8118 for any IP address. The second line sets the local SOCKS5 proxy client port — don't forget the trailing space and period at the end. Then run: This starts Privoxy using the configuration file. Run the following command to check the running status: If you see output like this, it's working: You can also use to create a proxy profile and verify it works. For more details, see How to Use Privoxy. Scrapy Integration In :