Browser Worker vs. Code Worker
Browser worker and code worker are two technical approaches for scraping, and you should choose between them based on your needs and budget, and based on technical challenges you’re facing with the website you’re scraping. Browser Workers simulates user interactions with websites through a headless browser, capable of handling complex scraping tasks such as user input and dynamic content loading. Running your code with a browser worker involves higher CPM, but in some usecases it is the only feasible solution to get the data. Code Workers, on the other hand, run on the server-side and perform scraping tasks through HTTP requests. A script or program sends requests to the target website, extracts data from the individual responses, and saves it to a file or database. Running your code with a code worker is cheaper and brings faster results. You can switch between worker types per scraper at any time, so you never commit to any specific worker, but be aware that there are a few functions (like ‘wait’) that are designed for and limited to work only with a Browser Worker. Lean more about worker typesInteraction and Parsing
Interaction and parsing are two key steps in web scraping that involve accessing and manipulating the HTML content of a website. Interaction refers to the process of moving between different pages or sections of a website to locate the data that you want to scrape. This typically involves sending GET or POST requests to the website’s URLs, and following links or submitting forms to access different pages or sections. It also involves actions/commands such as click, type, and wait. Once the browser page has the data you want, call parse() to get the data - this will trigger the Parser code, then call collect() to add a record to your final dataset. For example:- Navigate to the search page of the ecommerce website using a GET or POST request
- Locate the HTML elements that contain the number of pages
- Parse the HTML to extract the number of search results pages
- Navigate to each result page and do the following
- Locate the HTML elements that contain each search result data
- Parse the HTML content of each search results to collect the URLs of each product page
- Navigate to each product page and do the following
- Locate the HTML elements that contain the desired product data
- Parse the HTML to extract the desired product data