- Navigation - Moving through websites to reach target data
- Parsing - Extracting and structuring the data
Browser Worker vs. Code Worker
Browser worker and code worker are two technical approaches for scraping, and you should choose between them based on your needs and technical challenges you’re facing with the website you’re scraping. Browser Workers simulates user interactions with websites through a headless browser, capable of handling complex scraping tasks such as user input and dynamic content loading. Code Workers, on the other hand, run on the server-side and perform scraping tasks through HTTP requests. A script or program sends requests to the target website, extracts data from the individual responses, and saves it to a file or database. Running your code with a code workers can bring faster results. Switching Between Workers You can switch between worker types per scraper at any time, so you’re never committed to a specific worker. However, be aware that some functions (like ‘wait’) are designed for and limited to Browser Workers only. Lean more about worker typesInteraction and Parsing
Interaction and parsing are two key steps in web scraping. Together, they let you navigate a website, load the relevant content, and extract structured data from it. Interaction is the process of navigating through a website to reach the pages or sections that contain the data you want to collect. This can include:- Sending GET or POST requests
- Following links and handling pagination
- Submitting forms
- Performing browser actions such as click, type, and wait
parse() to extract it. This triggers the Parser code, which reads the page content (HTML/JSON) and returns the extracted fields.
After parsing, call collect() to add the extracted record to your final dataset.
For example:
- Navigate to the search page of the ecommerce website
- Locate the HTML elements that contain the number of pages
- Parse the HTML to extract the number of search results pages
- Call
next_stage({ search_page })to queue the next stage with that page’s URL as input. - Navigate to each result page
- Locate the HTML elements that contain each search result data
- Parse the HTML content of each search results to collect all listing URLs found on that page
- Navigate to each product/listing page
- Locate the HTML elements that contain the desired product data
- Parse the HTML to extract the desired product data