Web Scraper IDE 最佳实践

缺少 `dead_page` 条件

在使用 navigate 命令时，应添加 dead_page 条件来检查页面是否不存在。这将防止自动重试。虽然当响应状态为 404 时系统会自动处理，但在某些情况下，网站可能会返回其他不同的状态码：下面是一些好与坏的示例（你可以通过点击 “Bad” 和 “Good” 标签来切换）

try {
  // no need to wait 30sec 'ok-selector' in case of dead_page()
  wait('ok-selector');
} catch(e) {
  // in this case we can't be sure that the page is real dead
  dead_page('Page doesn\'t exist');
}

减少对浏览器的请求数量

某些交互命令（如 click、type、el_exists、el_is_visible、wait、wait_visible）会向浏览器发出实际请求，从而增加延迟并降低性能。建议合并选择器并进行单次调用，而不是多次调用。

if (!(el_exists('#price1')) || el_exists('#price2')
  || el_exists('#price3') || el_exists('#discount'))
{
    dead_page('No price found');
}

错误使用 `rerun_stage()`

当网站页面包含分页并且你需要所有分页的数据时，应从根页面调用 rerun_stage()，而不是从每个页面调用。这样系统可以并行化请求，使 scraper 更快。

navigate(input.url);
let $ = html_load(html());
let next_page_url = $('.next_page').attr('href');
rerun_stage({url: next_page_url});

不要花时间等待弹窗出现。使用 close_popup('popup_selector', 'close_button_selector') 来关闭弹窗。弹窗可能随时出现，在大多数情况下，不建议在每个交互命令之前添加检查。

navigate('https://example.com');
try {
  wait_visible('.cky-btn-accept', {timeout: 5000});
  click('.cky-btn-accept');
} catch(e) {
    console.log('Accept cookies button does not exist, continue');
}

将 `wait_for_parser_value()` 与 `tag_response()` 搭配使用

当你使用 tag_response 并且需要确保请求完成后再从页面收集数据，应使用 wait_for_parser_value()：

tag_response('product', /api\/product/);
navigate('https://example.com');

// parser code
// in this case we can't be sure that the request is finished
let {product} = parser;
return product.data;

自定义错误消息

尽可能避免自定义错误消息。系统会尽力为你提供最准确的错误信息：

try {
  wait('selector1');
  //some code
  wait('selector2');
  //some code
} catch(e) {
  throw "Page not loaded properly"
}

网站响应缓慢，增加超时时间

如果网站未正常加载，可能是由于 peer 连接不佳。此时最好显示错误消息，系统将尝试通过更稳定的会话重新加载该页面。

// 120 sec to long for waiting
wait('selector', {timeout: 120000});

重试机制

scraper 代码应简洁，专注于采集任务本身。不需要重新发明轮子。遇到与代码无关的问题时，应让系统处理并报告错误。

let counter = input.counter || 5;
while (counter > 1) {
  try {
    wait('selector' , {timeout: 500});
    click('selector');
    type('selector');
    //some code
    break;
  } catch(e) {
    // not acceptable use rerun_stage to create new session in case of error
    return rerun_stage({...input, counter: --counter});
  }
}

避免使用 try-catch

避免 try-catch 能提升代码简洁性和可读性，并能更有效地处理潜在的 null 或 undefined 值。

try {
  const example = obj.prop;
} catch(e) {}

Parser 代码：从多个元素中获取值

最佳实践是使用更简洁的 toArray() 和 map() 方法来代替传统的 each() 循环。这提升了可读性并保持声明式风格。

const links = [];
$('.card.product-wrapper').each(function(i, el) {
  links.push({url: $(this).find('h4 a').attr('href')});
})
return links;

文本规范化

我们为 cheerio 原型新增了自定义函数 $(selector).text_sane()，可移除所有不必要的空白并替换为单个空格。

$.prototype.clearText = function () {
  return this.text().replace(/\s+/g, ' ').trim();
}

介绍

产品指南

缺少 `dead_page` 条件

减少对浏览器的请求数量

错误使用 `rerun_stage()`

将 `wait_for_parser_value()` 与 `tag_response()` 搭配使用

自定义错误消息

网站响应缓慢，增加超时时间

重试机制

避免使用 try-catch

Parser 代码：从多个元素中获取值

文本规范化

介绍

产品指南

​缺少 dead_page 条件

​减少对浏览器的请求数量

​错误使用 rerun_stage()

​使用 close_popup() 关闭弹窗

​将 wait_for_parser_value() 与 tag_response() 搭配使用

​自定义错误消息

​网站响应缓慢，增加超时时间

​重试机制

​避免使用 try-catch

​Parser 代码：从多个元素中获取值

​文本规范化

缺少 `dead_page` 条件

减少对浏览器的请求数量

错误使用 `rerun_stage()`

使用 `close_popup()` 关闭弹窗

将 `wait_for_parser_value()` 与 `tag_response()` 搭配使用

自定义错误消息

网站响应缓慢，增加超时时间

重试机制

避免使用 try-catch

Parser 代码：从多个元素中获取值

文本规范化