Complete Web Scraper IDE Examples

Introduction

The collect and parse commands have been removed. The data will be returned from parser code as an object or array, and it will be automatically saved to the output:

// Old code
navigate("https://example.com");
collect(parse());

// New code
navigate("https://example.com");

// New code alternative
navigate("https://example.com");
tag_html("html_key");

New commands have been added to provide access to the data from the interaction code: tag_html, tag_request, tag_graphql

In addition, some existing commands have been updated: tag_response, tag_sitemap, tag_all_responses. See IDE documentation for more details.

When using any tag commands, you can provide a custom name. You can then access the data using this name in the parser code under parser.YOUR_KEY.

For tag_html, current browser location URL will be saved under parser.YOUR_KEY_url.

For simple cases when only a single tag_html is needed, it can be skipped, and it will be automatically saved under parser.page.

navigate("https://example.com/1");
tag_html("page1");
navigate("https://example.com/2");
tag_html("page2");
navigate("https://example.com/3");
tag_html("page3");

Sometimes, it’s necessary to get parsed data within the interaction code and use it to make request. See examples how to do it:

navigate("https://example.com/1");
tag_html("page1");
let page_html = html();
let page_html2 = wait_for_parser_value("page1"); // the same
let $ = load_html(page_html);
let req_id = $('.product-id').text();
tag_request("product_json", {url: "https://example.com/product/"+req_id});

Multiple results

To collect multiple results, array can be returned from the parser code.

navigate("https://example.com/products");

Reparse

Reparse is a new feature that allows to reparse the data that was already collected. It can be useful when you want to change the parser code without rerunning the entire interaction code:

`next_stage` and `rerun_stage`

When scraper has more than one step, the parser code is only available in the last step. All other steps can only have next_stage calls. To parse something from the page, load_html should be used:

navigate(input.domain_url);
tag_html('html');
const $ = load_html(wait_for_parser_value('html'));
$('a.layout-categories-category__link').toArray()
  .map(v => new URL($(v).attr('href'), location.href))
  .filter(x => !x?.href.includes('home')).forEach(i => {
  next_stage({url: i})
});

Basic PDP scraper

let url = new URL(input.url.replace('https://www.slintel.com','https://6sense.com'));
url = new URL(url.pathname, 'https://6sense.com');
navigate(url);
if (location.href === 'https://6sense.com/company')
  dead_page(`Page not found`);
tag_html('html');

Multiple navigates example

const tabs = [
  'topactivity',
  'answers',
  'questions',
  'tags',
  'articles',
  'badges',
  'bookmarks',
  'bounties'
];
function loadTab(url, name) {
  return new Promise(async (resolve, reject)=>{
    try {
        const tabUrl = new URL(url);
        tabUrl.searchParams.set('tab', name);
        navigate(tabUrl, { allow_status: [404] });
        const html_ = html();
        tag_html(name);
        resolve(html_);
    } catch(e) { reject(e); }
  });
}
const userIdFromUrl = (input.url && input.url.includes('https://stackoverflow.com/users/'))
    ? input.url.replace('https://stackoverflow.com/users/', '').split('/').shift()
    : null;
const userId = input.user_id || userIdFromUrl || 1;
const userUrl = `https://stackoverflow.com/users/${userId}`;
navigate(userUrl, { allow_status: [404] });
tag_html('user');
Promise.all(tabs.map( async tabName => loadTab(userUrl, tabName)));

Multiple `tag_response`

close_popup('._1piuevz', '._1piuevz');
tag_response('stay', /\/StaysPdpSections/);
tag_response('calendar', /\/PdpAvailabilityCalendar/);
tag_response('reviews', /api\/v3\/StaysPdpReviews/);
navigate(input.url+'?enable_auto_translate=false#availability-calendar');
if (el_exists('img[src*="error_pages/404"]'))
  return dead_page('Page not found');
const room_id = /\/(\d+)(?:\?|$)/.exec(input.url)?.[1];
if (el_exists('._wgmchy ._1qx9l5ba'))
  click('._wgmchy ._1qx9l5ba');
wait('#data-state, #data-deferred-state',{timeout: 6e4});
wait_for_parser_value('stay');
wait_for_parser_value('calendar');
wait_for_parser_value('reviews');
tag_html('html');

​Introduction

​Multiple results

​Reparse

​next_stage and rerun_stage

​Basic PDP scraper

​Multiple navigates example

​Multiple tag_response