Missing dead_page condition.

When using navigate command, a ‘dead_page’ condition should be added to check if the page is not found. This will prevent automatic retries. While we automatically handle this when the response status is 404, in some cases, the website may respond with various other status codes:

try {
  // no need to wait 30sec 'ok-selector' in case of dead_page()
  wait('ok-selector');
} catch(e) {
  // in this case we can't be sure that the page is real dead
  dead_page('Page doesn\'t exist');
}

Minimize amount of requests to the browser

Some interaction commands like click type el_exists el_is_visible wait wait_visible make real requests to the browser and may increase latency and decrease performance. It is recommended to combine selectors and make a single call instead of multiple calls.

if (!(el_exists('#price1')) || el_exists('#price2')
  || el_exists('#price3') || el_exists('#discount'))
{
    dead_page('No price found');
}

Incorrect usage of rerun_stage()

When a website has pagination on the page and data from all pages is required, rerun_stage() should be called for each page from the root page instead of calling it from each page. This allows the system to parallelize the requests and make the scraper faster.

navigate(input.url);
let $ = html_load(html());
let next_page_url = $('.next_page').attr('href');
rerun_stage({url: next_page_url});

Use close_popup() to close popups

Do not spend time waiting for popups to appear. Use close_popup('popup_selector', 'close_button_selector') to close popups. A popup can appear at any time, and in most cases, adding a check before each interaction command is not desirable. Bad

navigate('https://example.com');
try {
  wait_visible('.cky-btn-accept', {timeout: 5000});
  click('.cky-btn-accept');
} catch(e) {
    console.log('Accept cookies button does not exist, continue');
}

Use wait_for_parser_value() with tag_response()

When using tag_response command and needing to ensure that request is finished before collecting data from the page, wait_for_parser_value() should be used:

tag_response('product', /api\/product/);
navigate('https://example.com');

// parser code
// in this case we can't be sure that the request is finished
let {product} = parser;
return product.data;

Custom error messages

Avoid using custom error messages when possible. Our system does the best to provide you with the most accurate error messages:

try {
  wait('selector1');
  //some code
  wait('selector2');
  //some code
} catch(e) {
  throw "Page not loaded properly"
}

Slow website response, increasing timeouts

If the website is not loading properly, it may be due to a poor peer connection. It is advisable to display an error message, and the system will attempt to load the page using a more stable peer session.

// 120 sec to long for waiting
wait('selector', {timeout: 120000});

Retry mechanism

The scraper code should be clear and focus solely on the necessary tasks for scraping data. There is no need to attempt to reinvent the wheel. It’s better to emphasize issues unrelated to the code and report them in the system.

let counter = input.counter || 5;
while (counter > 1) {
  try {
    wait('selector' , {timeout: 500});
    click('selector');
    type('selector');
    //some code
    break;
  } catch(e) {
    // not acceptable use rerun_stage to create new session in case of error
    return rerun_stage({...input, counter: --counter});
  }
}


Avoid using a try-catch block

This facilitates the development of concise and readable code, effectively managing potential ‘null’ or ‘undefined’ values without the reliance on a try-catch block.

try {
  const example = obj.prop;
} catch(e) {}

Parser code: get values from set of elements

The best practice code employs the more concise and functional toArray() and map() methods instead of the traditional each() loop. This enhances code readability and upholds a declarative style.

const links = [];
$('.card.product-wrapper').each(function(i, el) {
  links.push({url: $(this).find('h4 a').attr('href')});
})
return links;

Normalizing text

We added a custom function to cheerio prototype $(selector).text_sane() that removes all unnecessary whitespace characters and replaces them with a single space.

$.prototype.clearText = function () {
  return this.text().replace(/\s+/g, ' ').trim();
}