Scrapy Tutorial

Introduction This post uses Scrapy to build a sample web scraper. Installation of Scrapy is omitted here — installing via pip is straightforward. Creating a Project Run the command: Output: The main directories in the project are: Creating the First Spider A spider is what we commonly call a "crawler." In Scrapy, we write custom spider classes to parse data from websites. Custom spider classes inherit from and define the initial URLs, how to follow links on pages, and how to parse page content to extract and generate items. Create a new file in the directory: In this code, inherits from . The main attributes and methods are: - : The spider's identifier. Must be unique within the same project. - ): Must return an iterable of requests (either a list of requests or a generator function). The spider will crawl these requests. Subsequent URLs are extracted from the content retrieved by these initial requests. - : The response returned by a request is passed to this method for parsing. The response parameter is an instance of the TextResponse class, which contains the page content and provides several useful methods. The method is typically used to parse responses, extract scraped data as dictionaries, and find new URLs from which to create new Request objects. Running the Spider Execute from the command line: This runs the spider we just created and sends requests to the stackoverflow website. Output: Two new files, stackoverflow-1.html and stackoverflow-2.html, will be created in the current directory, each containing the content from the corresponding URL. How It Works Scrapy takes the scrapy.Request objects returned by the spider's method and queues them for unified management. For each response received, a Response object is instantiated and passed as a parameter to the callback method associated with the request (in this case ). Simplifying startrequests We can define a list instead of implementing the method. The list will be used by the default method to create initial requests for the spider. Scrapy will call the method by default and return the response data for the corresponding request. Parsing Data We can use the Scrapy shell tool to learn how to parse data with Scrapy. Execute: Output: Then you can run commands to perform CSS parsing: Output: Running returns a object, which holds Selector objects wrapped in XML/HTML elements for further parsing. To extract text from the title above: Two things worth noting here: First, using means only text nodes within the element are selected. Without , you get the full title element including its tags: Second, calling returns a list — a instance. To get only the first result: Or like this: Using avoids an when no match is found. Error handling is important in most scraping projects — even if errors occur during scraping, you can still retrieve partial data. In addition to and , you can also use the method with regular expressions*: Use to open the response page in a browser and debug the appropriate CSS selectors. XPath In addition to CSS, you can also use XPath: Scrapy selectors are actually built on top of the powerful XPath implementation. In fact, CSS selectors are converted to XPath selectors under the hood. See the following links for more details: - using XPath with Scrapy Selectors here - this tutorial to learn XPath through examples - this tutorial to learn "how to think in XPath" Parsing Questions Execute the command: This gets the first question. Then execute: This retrieves the content of the first question. Now get the answer excerpt: Parsing Data in the Spider Let's integrate the above commands into our code. Typically, a Scrapy spider needs to generate dictionaries from page-extracted content, so we use the keyword: Run the spider and the output will include: Storing Data The simplest approach is to use Feed exports with the following command: This generates a JSON file in the current directory. For historical reasons, running the command again will append to the file rather than overwrite it. If you don't remove the file before re-running, the JSON format will be corrupted. You can also use other formats, such as JSON Lines: This format handles multiple runs and new record appends without any formatting issues. Additionally, since each record is on a separate line, large files can be processed without loading all content into memory (you can use command-line tools like JQ for processing). The above is sufficient for small projects. But for more complex processing, you will need to use Item Pipeline. An Item Pipeline is created by default at project initialization: . Following Links Sometimes we need to crawl content from subsequent linked pages. You can see elements like this on the page: Parse it in the shell: Then parse the attribute: Update the spider code to be able to parse the next page: After parsing the data, the method extracts the next page URL, uses to build the full URL, returns the request for the next page, and registers as the callback for further parsing. This completes crawling across all pages. Scrapy's internal mechanism: when a request is generated in a callback method, Scrapy places it in a queue for unified scheduling and executes the registered callback when the request completes. Using this mechanism, you can build complex scrapers with custom rules and parse different types of data based on the crawled pages. Due to the large volume of data, I didn't wait for the spider to finish crawling. Quickly Creating Requests You can use to quickly create Request objects: supports relative paths and directly returns a Request object — just yield it. can even accept a selector directly instead of a string: For tags, there is an even simpler approach: Note: Since returns a list, needs to iterate with a for loop or take only the first element. Other Examples The spider above starts from the main page and crawls the profile page of each user who asked a question. By default, Scrapy filters out duplicate URLs. This can be configured in . CrawlSpider is a generic spider that you can use as a base class for writing scrapers. Additionally, a common pattern is to build an item from data on multiple pages, and pass extra data to callbacks using this method. Spider Arguments When running a spider, you can pass command-line arguments using the option. These arguments are passed to the spider's method as default attributes. In the example below, the command is , allowing the attribute to be accessed in the class: Here you can find more information about argument handling. Going Further Of course, the examples above are relatively simple. Scrapy has many other features — here you can find more. Code All code from this article can be found on my GitHub: ScrapyDemo.