Documentation Index Fetch the complete documentation index at: https://docs.brightdata.com/llms.txt
Use this file to discover all available pages before exploring further.
本文仅适用于 IDE Normal V2 版本,该版本供内部使用,对客户不开放。
collect 和 parse 命令已被移除。数据将从 parser 代码中以对象或数组的形式返回,并会自动保存到输出中:
Interaction code
Parser code
// Old code
navigate ( "https://example.com" );
collect ( parse ());
// New code
navigate ( "https://example.com" );
// New code alternative
navigate ( "https://example.com" );
tag_html ( "html_key" );
新增了命令来提供从交互代码访问数据的能力:tag_html、tag_request、tag_graphql
此外,一些现有命令已更新:tag_response、tag_sitemap、tag_all_responses。更多详情请参阅 IDE 文档 。
当使用任意 tag 命令时,你可以提供自定义名称。然后你可以在 parser 代码中使用 parser.YOUR_KEY 来访问数据。
对于 tag_html,当前浏览器位置 URL 将保存到 parser.YOUR_KEY_url 下。
对于只需要单个 tag_html 的简单场景,它可以省略,并将自动保存为 parser.page。
Interaction code
Parser code
navigate ( "https://example.com/1" );
tag_html ( "page1" );
navigate ( "https://example.com/2" );
tag_html ( "page2" );
navigate ( "https://example.com/3" );
tag_html ( "page3" );
有时,需要在交互代码中获取解析后的数据,并使用它发出请求。请查看以下示例:
Interaction code
Parser code
navigate ( "https://example.com/1" );
tag_html ( "page1" );
let page_html = html ();
let page_html2 = wait_for_parser_value ( "page1" ); // the same
let $ = load_html ( page_html );
let req_id = $ ( '.product-id' ). text ();
tag_request ( "product_json" , { url: "https://example.com/product/" + req_id });
多结果
要收集多个结果,可以从 parser 代码返回数组。
Interaction code
Parser code
navigate ( "https://example.com/products" );
重新解析(Reparse)
Reparse 是一项新功能,允许重新解析已收集的数据。当你想修改 parser 代码但不想重新运行整个交互代码时,它会非常有用:
next_stage 和 rerun_stage
当 scraper 有多个步骤时,parser 代码仅能在最后一步使用。其他步骤只能使用 next_stage。要从页面解析内容,需要使用 load_html:
Interaction code step 1
Interaction code step 2
Parser code
navigate ( input . domain_url );
tag_html ( 'html' );
const $ = load_html ( wait_for_parser_value ( 'html' ));
$ ( 'a.layout-categories-category__link' ). toArray ()
. map ( v => new URL ( $ ( v ). attr ( 'href' ), location . href ))
. filter ( x => ! x ?. href . includes ( 'home' )). forEach ( i => {
next_stage ({ url: i })
});
基本 PDP 抓取器
Interaction code
Parser Code
let url = new URL ( input . url . replace ( 'https://www.slintel.com' , 'https://6sense.com' ));
url = new URL ( url . pathname , 'https://6sense.com' );
navigate ( url );
if ( location . href === 'https://6sense.com/company' )
dead_page ( `Page not found` );
tag_html ( 'html' );
多次导航示例
Interaction code
Parser code
const tabs = [
'topactivity' ,
'answers' ,
'questions' ,
'tags' ,
'articles' ,
'badges' ,
'bookmarks' ,
'bounties'
];
function loadTab ( url , name ) {
return new Promise ( async ( resolve , reject ) => {
try {
const tabUrl = new URL ( url );
tabUrl . searchParams . set ( 'tab' , name );
navigate ( tabUrl , { allow_status: [ 404 ] });
const html_ = html ();
tag_html ( name );
resolve ( html_ );
} catch ( e ) { reject ( e ); }
});
}
const userIdFromUrl = ( input . url && input . url . includes ( 'https://stackoverflow.com/users/' ))
? input . url . replace ( 'https://stackoverflow.com/users/' , '' ). split ( '/' ). shift ()
: null ;
const userId = input . user_id || userIdFromUrl || 1 ;
const userUrl = `https://stackoverflow.com/users/ ${ userId } ` ;
navigate ( userUrl , { allow_status: [ 404 ] });
tag_html ( 'user' );
Promise . all ( tabs . map ( async tabName => loadTab ( userUrl , tabName )));
多个 tag_response
Interaction code
Parser code
close_popup ( '._1piuevz' , '._1piuevz' );
tag_response ( 'stay' , / \/ StaysPdpSections/ );
tag_response ( 'calendar' , / \/ PdpAvailabilityCalendar/ );
tag_response ( 'reviews' , /api \/ v3 \/ StaysPdpReviews/ );
navigate ( input . url + '?enable_auto_translate=false#availability-calendar' );
if ( el_exists ( 'img[src*="error_pages/404"]' ))
return dead_page ( 'Page not found' );
const room_id = / \/ ( \d + )(?: \? |$ ) / . exec ( input . url )?.[ 1 ];
if ( el_exists ( '._wgmchy ._1qx9l5ba' ))
click ( '._wgmchy ._1qx9l5ba' );
wait ( '#data-state, #data-deferred-state' ,{ timeout: 6e4 });
wait_for_parser_value ( 'stay' );
wait_for_parser_value ( 'calendar' );
wait_for_parser_value ( 'reviews' );
tag_html ( 'html' );