Scrapy Architecture Overview

Introduction When it comes to Python practice projects, web scraping is probably the most popular choice. Libraries like Requests and BeautifulSoup feel a bit like toys when used on their own, so I wanted to try Scrapy. This post introduces Scrapy's architecture, drawing mainly from the official documentation Architecture overview, which shows Scrapy's architecture and how its components interact. Data Flow The diagram below illustrates Scrapy's architecture and the data flow (red arrows) between components. A brief description of each component and the data flow follows. In Scrapy, data flow is controlled by the execution engine. The main steps are: 1. The Engine gets the initial Requests to crawl from the Spider. 2. The Engine passes the Requests to the Scheduler for unified scheduling, and also retrieves the next Requests to crawl. 3. The Scheduler returns the next Requests to crawl to the Engine. 4. The Engine sends these Requests to the Downloader via Downloader Middlewares (). 5. Once a page has been downloaded, the Downloader generates a Response (containing the page content) and sends it to the Engine via Downloader Middlewares (). 6. The Engine receives the Response from the Downloader and sends it to the Spider for parsing via Spider Middleware (). 7. The Spider parses the Response and returns the parsed content and subsequent Requests to the Engine via Spider Middleware (). 8. The Engine sends the processed content to Item Pipelines, then sends the subsequent Requests to the Scheduler for scheduling and retrieves the next Requests to crawl. 9. Repeat from step 1 until all crawl requests in the Scheduler have been processed. Regarding steps 2 and 8, I think the interaction between the Engine and the Scheduler is best understood this way: the Engine hands off all pending Requests to the Scheduler, which manages them in a queue-like structure, and continuously feeds the next Requests back to the Engine. In other words, the Scheduler is responsible for maintaining the list of Requests and determining the order of crawling. Component Descriptions Scrapy Engine The Engine is responsible for controlling data flow between all components of the system, and triggering the appropriate events when certain actions occur. See the Data Flow section for details. Scheduler The Scheduler receives crawl requests from the Engine, enqueues them, and provides them back to the Engine when requested. Downloader The Downloader is responsible for fetching web pages and returning them to the Engine, which then sends them to the Spiders. Spiders Spiders are custom classes we write to parse responses and extract the required content or the next requests to follow. For more details, see Spiders. Item Pipeline The Item Pipeline is responsible for processing the content extracted by Spiders — for example, data cleaning, validation, and persistence (storing to a database). For more details, see Item Pipeline. Downloader middlewares Downloader middlewares sit between the Engine and the Downloader. They allow the Engine to pass requests to the Downloader, and the Downloader to pass downloaded responses back to the Engine. Use Downloader middlewares when you need to: - Process a request before it is sent to the Downloader (i.e., before Scrapy sends the request to the website); - Process a response before it is sent to the spider; - Send a new Request instead of passing a received response to a spider; - Pass a response to a spider without fetching a web page; - Silently drop certain requests. For more details, see Downloader middlewares. Spider middlewares Spider middlewares sit between the Engine and the Spiders. They handle Spider input (responses) and output (parsed content and requests). Use Spider middlewares when you need to: - Modify Spider output — for example, changing/adding/removing requests or items; - Modify the initial web requests; - Handle Spider exceptions; - Call errback instead of returning a web request based on response content. For more details, see Spider middlewares. Event-driven networking Scrapy is written with Twisted, a popular event-driven networking framework for Python. As a result, it uses non-blocking (asynchronous) code for concurrency. For more information on asynchronous programming and Twisted, see these links: - Introduction to Deferreds in Twisted - Twisted - hello, asynchronous programming - Twisted Introduction - Krondo