Best Practices for Web Scraper IDE

Missing `dead_page` condition.

When using navigate command, a ‘dead_page’ condition should be added to check if the page is not found. This will prevent automatic retries. While we automatically handle this when the response status is 404, in some cases, the website may respond with various other status codes:

Here are good and bad practices examples (you can nevigate between them by clicking on the “Bad” “Good” tabs)

Minimize amount of requests to the browser

Some interaction commands like click type el_exists el_is_visible wait wait_visible make real requests to the browser and may increase latency and decrease performance. It is recommended to combine selectors and make a single call instead of multiple calls.

Incorrect usage of `rerun_stage()`

When a website has pagination on the page and data from all pages is required, rerun_stage() should be called for each page from the root page instead of calling it from each page. This allows the system to parallelize the requests and make the scraper faster.

Do not spend time waiting for popups to appear. Use close_popup('popup_selector', 'close_button_selector') to close popups. A popup can appear at any time, and in most cases, adding a check before each interaction command is not desirable. Bad

Use `wait_for_parser_value()` with `tag_response()`

When using tag_response command and needing to ensure that request is finished before collecting data from the page, wait_for_parser_value() should be used:

Custom error messages

Avoid using custom error messages when possible. Our system does the best to provide you with the most accurate error messages:

Slow website response, increasing timeouts

If the website is not loading properly, it may be due to a poor peer connection. It is advisable to display an error message, and the system will attempt to load the page using a more stable peer session.

Retry mechanism

The scraper code should be clear and focus solely on the necessary tasks for scraping data. There is no need to attempt to reinvent the wheel. It’s better to emphasize issues unrelated to the code and report them in the system.

Avoid using a try-catch block

This facilitates the development of concise and readable code, effectively managing potential ‘null’ or ‘undefined’ values without the reliance on a try-catch block.

Parser code: get values from set of elements

The best practice code employs the more concise and functional toArray() and map() methods instead of the traditional each() loop. This enhances code readability and upholds a declarative style.

Normalizing text

We added a custom function to cheerio prototype $(selector).text_sane() that removes all unnecessary whitespace characters and replaces them with a single space.

​Missing dead_page condition.

​Minimize amount of requests to the browser

​Incorrect usage of rerun_stage()

​Use close_popup() to close popups

​Use wait_for_parser_value() with tag_response()

​Custom error messages

​Slow website response, increasing timeouts

​Retry mechanism

​Avoid using a try-catch block

​Parser code: get values from set of elements

​Normalizing text