Best Practices for Web Scraper IDE
Learn the best practices for using Web Scraper IDE, including optimizing performance, handling errors, managing retries, and writing efficient scraper code.
Missing dead_page
condition.
When using navigate
command, a ‘dead_page’ condition should be added to check if the page is not found. This will prevent automatic retries. While we automatically handle this when the response status is 404, in some cases, the website may respond with various other status codes:
Here are good and bad practices examples (you can nevigate between them by clicking on the “Bad” “Good” tabs)
Minimize amount of requests to the browser
Some interaction commands like click type el_exists el_is_visible wait wait_visible make real requests to the browser and may increase latency and decrease performance. It is recommended to combine selectors and make a single call instead of multiple calls.
Incorrect usage of rerun_stage()
When a website has pagination on the page and data from all pages is required, rerun_stage()
should be called for each page from the root page instead of calling it from each page. This allows the system to parallelize the requests and make the scraper faster.
Use close_popup()
to close popups
Do not spend time waiting for popups to appear. Use close_popup('popup_selector', 'close_button_selector')
to close popups. A popup can appear at any time, and in most cases, adding a check before each interaction command is not desirable.
Bad
Use wait_for_parser_value()
with tag_response()
When using tag_response command and needing to ensure that request is finished before collecting data from the page, wait_for_parser_value()
should be used:
Custom error messages
Avoid using custom error messages when possible. Our system does the best to provide you with the most accurate error messages:
Slow website response, increasing timeouts
If the website is not loading properly, it may be due to a poor peer connection. It is advisable to display an error message, and the system will attempt to load the page using a more stable peer session.
Retry mechanism
The scraper code should be clear and focus solely on the necessary tasks for scraping data. There is no need to attempt to reinvent the wheel. It’s better to emphasize issues unrelated to the code and report them in the system.
Avoid using a try-catch block
This facilitates the development of concise and readable code, effectively managing potential ‘null’ or ‘undefined’ values without the reliance on a try-catch block.
Parser code: get values from set of elements
The best practice code employs the more concise and functional toArray() and map() methods instead of the traditional each() loop. This enhances code readability and upholds a declarative style.
Normalizing text
We added a custom function to cheerio prototype $(selector).text_sane()
that removes all unnecessary whitespace characters and replaces them with a single space.
Was this page helpful?