Learn the best practices for using Web Scraper IDE, including optimizing performance, handling errors, managing retries, and writing efficient scraper code.
dead_page
condition.When using navigate
command, a ‘dead_page’ condition should be added to check if the page is not found. This will prevent automatic retries. While we automatically handle this when the response status is 404, in some cases, the website may respond with various other status codes:
Here are good and bad practices examples (you can nevigate between them by clicking on the “Bad” “Good” tabs)
Some interaction commands like click type el_exists el_is_visible wait wait_visible make real requests to the browser and may increase latency and decrease performance. It is recommended to combine selectors and make a single call instead of multiple calls.
rerun_stage()
When a website has pagination on the page and data from all pages is required, rerun_stage()
should be called for each page from the root page instead of calling it from each page. This allows the system to parallelize the requests and make the scraper faster.
close_popup()
to close popupsDo not spend time waiting for popups to appear. Use close_popup('popup_selector', 'close_button_selector')
to close popups. A popup can appear at any time, and in most cases, adding a check before each interaction command is not desirable.
Bad
wait_for_parser_value()
with tag_response()
When using tag_response command and needing to ensure that request is finished before collecting data from the page, wait_for_parser_value()
should be used:
Avoid using custom error messages when possible. Our system does the best to provide you with the most accurate error messages:
If the website is not loading properly, it may be due to a poor peer connection. It is advisable to display an error message, and the system will attempt to load the page using a more stable peer session.
The scraper code should be clear and focus solely on the necessary tasks for scraping data. There is no need to attempt to reinvent the wheel. It’s better to emphasize issues unrelated to the code and report them in the system.
This facilitates the development of concise and readable code, effectively managing potential ‘null’ or ‘undefined’ values without the reliance on a try-catch block.
The best practice code employs the more concise and functional toArray() and map() methods instead of the traditional each() loop. This enhances code readability and upholds a declarative style.
We added a custom function to cheerio prototype $(selector).text_sane()
that removes all unnecessary whitespace characters and replaces them with a single space.
Learn the best practices for using Web Scraper IDE, including optimizing performance, handling errors, managing retries, and writing efficient scraper code.
dead_page
condition.When using navigate
command, a ‘dead_page’ condition should be added to check if the page is not found. This will prevent automatic retries. While we automatically handle this when the response status is 404, in some cases, the website may respond with various other status codes:
Here are good and bad practices examples (you can nevigate between them by clicking on the “Bad” “Good” tabs)
Some interaction commands like click type el_exists el_is_visible wait wait_visible make real requests to the browser and may increase latency and decrease performance. It is recommended to combine selectors and make a single call instead of multiple calls.
rerun_stage()
When a website has pagination on the page and data from all pages is required, rerun_stage()
should be called for each page from the root page instead of calling it from each page. This allows the system to parallelize the requests and make the scraper faster.
close_popup()
to close popupsDo not spend time waiting for popups to appear. Use close_popup('popup_selector', 'close_button_selector')
to close popups. A popup can appear at any time, and in most cases, adding a check before each interaction command is not desirable.
Bad
wait_for_parser_value()
with tag_response()
When using tag_response command and needing to ensure that request is finished before collecting data from the page, wait_for_parser_value()
should be used:
Avoid using custom error messages when possible. Our system does the best to provide you with the most accurate error messages:
If the website is not loading properly, it may be due to a poor peer connection. It is advisable to display an error message, and the system will attempt to load the page using a more stable peer session.
The scraper code should be clear and focus solely on the necessary tasks for scraping data. There is no need to attempt to reinvent the wheel. It’s better to emphasize issues unrelated to the code and report them in the system.
This facilitates the development of concise and readable code, effectively managing potential ‘null’ or ‘undefined’ values without the reliance on a try-catch block.
The best practice code employs the more concise and functional toArray() and map() methods instead of the traditional each() loop. This enhances code readability and upholds a declarative style.
We added a custom function to cheerio prototype $(selector).text_sane()
that removes all unnecessary whitespace characters and replaces them with a single space.