完整的网页抓取工具 IDE 示例

简介

采集和解析指令已删除。数据将作为对象或阵列从解析器代码中返回，并将自动保存到输出中：

// Old code
navigate("https://example.com");
collect(parse());

// New code
navigate("https://example.com");

// New code alternative
navigate("https://example.com");
tag_html("html_key");

添加新的指令以提供对交互代码中数据的访问：tag_html、tag_request、tag_graphql

此外，更新一些现有指令：tag_response、tag_sitemap、tag_all_responses。有关更多详情，请参见IDE 文档: https://docs.brightdata.com/cn/api-reference/web-scraper-ide-api

使用任何标记指令时，您可以提供自定义名称。然后，您可以在解析器下的解析器代码中使用此名称访问数据。YOUR_KEY.

对于 tag_html，当前的浏览器位置 URL 将保存在解析器下。YOUR_KEY_url.

对于只需要一个 tag_html 的简单情况，可以跳过，它会自动保存在 parser.page 下。

navigate("https://example.com/1");
tag_html("page1");
navigate("https://example.com/2");
tag_html("page2");
navigate("https://example.com/3");
tag_html("page3");

有时，有必要在交互代码中获取经过解析的数据并用其发出请求。查看操作示例：

navigate("https://example.com/1");
tag_html("page1");
let page_html = html();
let page_html2 = wait_for_parser_value("page1"); // the same
let $ = load_html(page_html);
let req_id = $('.product-id').text();
tag_request("product_json", {url: "https://example.com/product/"+req_id});

多个结果

若要采集多个结果，可以从解析器代码返回阵列。

navigate("https://example.com/products");

重新解析

重新解析是一项新功能，允许重新解析已采集的数据。当您想在不重新运行整个交互代码的情况下更改解析器代码时，它可能十分有用：

`next_stage` and `rerun_stage`

当抓取工具有多个步骤时，解析器代码仅在最后一步中可用。所有其他步骤只能有 next_stage 调用。若要解析页面中的某些内容，应使用 load_html ：


navigate(input.domain_url);
tag_html('html');
const $ = load_html(wait_for_parser_value('html'));
$('a.layout-categories-category__link').toArray()
  .map(v => new URL($(v).attr('href'), location.href))
  .filter(x => !x?.href.includes('home')).forEach(i => {
  next_stage({url: i})
});

基本的 PDP 抓取工具

let url = new URL(input.url.replace('https://www.slintel.com','https://6sense.com'));
url = new URL(url.pathname, 'https://6sense.com');
navigate(url);
if (location.href === 'https://6sense.com/company')
  dead_page(`Page not found`);
tag_html('html');

多重导航示例

const tabs = [
  'topactivity',
  'answers',
  'questions',
  'tags',
  'articles',
  'badges',
  'bookmarks',
  'bounties'
];
function loadTab(url, name) {
  return new Promise(async (resolve, reject)=>{
    try {
        const tabUrl = new URL(url);
        tabUrl.searchParams.set('tab', name);
        navigate(tabUrl, { allow_status: [404] });
        const html_ = html();
        tag_html(name);
        resolve(html_);
    } catch(e) { reject(e); }
  });
}
const userIdFromUrl = (input.url && input.url.includes('https://stackoverflow.com/users/'))
    ? input.url.replace('https://stackoverflow.com/users/', '').split('/').shift()
    : null;
const userId = input.user_id || userIdFromUrl || 1;
const userUrl = `https://stackoverflow.com/users/${userId}`;
navigate(userUrl, { allow_status: [404] });
tag_html('user');
Promise.all(tabs.map( async tabName => loadTab(userUrl, tabName)));

多个 `tag_response`

close_popup('._1piuevz', '._1piuevz');
tag_response('stay', /\/StaysPdpSections/);
tag_response('calendar', /\/PdpAvailabilityCalendar/);
tag_response('reviews', /api\/v3\/StaysPdpReviews/);
navigate(input.url+'?enable_auto_translate=false#availability-calendar');
if (el_exists('img[src*="error_pages/404"]'))
  return dead_page('Page not found');
const room_id = /\/(\d+)(?:\?|$)/.exec(input.url)?.[1];
if (el_exists('._wgmchy ._1qx9l5ba'))
  click('._wgmchy ._1qx9l5ba');
wait('#data-state, #data-deferred-state',{timeout: 6e4});
wait_for_parser_value('stay');
wait_for_parser_value('calendar');
wait_for_parser_value('reviews');
tag_html('html');

​简介

​多个结果

​重新解析

​next_stage and rerun_stage

​基本的 PDP 抓取工具

​多重导航示例

​多个 tag_response