采集和解析指令已删除。 数据将作为对象或阵列从解析器代码中返回,并将自动保存到输出中:
Interaction code
Parser code
// Old code
navigate ( "https://example.com" );
collect ( parse ());
// New code
navigate ( "https://example.com" );
// New code alternative
navigate ( "https://example.com" );
tag_html ( "html_key" );
添加新的指令以提供对交互代码中数据的访问:tag_html
、tag_request
、tag_graphql
此外,更新一些现有指令:tag_response
、tag_sitemap
、tag_all_responses
。有关更多详情,请参见IDE 文档: https://docs.brightdata.com/cn/api-reference/web-scraper-ide-api
使用任何标记指令时,您可以提供自定义名称。 然后,您可以在解析器下的解析器代码中使用此名称访问数据。YOUR_KEY
.
对于 tag_html
,当前的浏览器位置 URL
将保存在解析器下。YOUR_KEY_url
.
对于只需要一个 tag_html
的简单情况,可以跳过,它会自动保存在 parser.page
下。
Interaction code
Parser code
navigate ( "https://example.com/1" );
tag_html ( "page1" );
navigate ( "https://example.com/2" );
tag_html ( "page2" );
navigate ( "https://example.com/3" );
tag_html ( "page3" );
有时,有必要在交互代码中获取经过解析的数据并用其发出请求。查看操作示例:
Interaction code
Parser code
navigate ( "https://example.com/1" );
tag_html ( "page1" );
let page_html = html ();
let page_html2 = wait_for_parser_value ( "page1" ); // the same
let $ = load_html ( page_html );
let req_id = $ ( '.product-id' ). text ();
tag_request ( "product_json" , { url: "https://example.com/product/" + req_id });
多个结果
若要采集多个结果,可以从解析器代码返回阵列。
Interaction code
Parser code
navigate ( "https://example.com/products" );
重新解析
重新解析是一项新功能,允许重新解析已采集的数据。 当您想在不重新运行整个交互代码的情况下更改解析器代码时,它可能十分有用:
next_stage
and rerun_stage
当抓取工具有多个步骤时,解析器代码仅在最后一步中可用。所有其他步骤只能有 next_stage
调用。若要解析页面中的某些内容,应使用 load_html
:
Interaction code step 1
Interaction code step 2
Parser code
navigate ( input . domain_url );
tag_html ( 'html' );
const $ = load_html ( wait_for_parser_value ( 'html' ));
$ ( 'a.layout-categories-category__link' ). toArray ()
. map ( v => new URL ( $ ( v ). attr ( 'href' ), location . href ))
. filter ( x => ! x ?. href . includes ( 'home' )). forEach ( i => {
next_stage ({ url: i })
});
基本的 PDP 抓取工具
Interaction code
Parser Code
let url = new URL ( input . url . replace ( 'https://www.slintel.com' , 'https://6sense.com' ));
url = new URL ( url . pathname , 'https://6sense.com' );
navigate ( url );
if ( location . href === 'https://6sense.com/company' )
dead_page ( `Page not found` );
tag_html ( 'html' );
多重导航示例
Interaction code
Parser code
const tabs = [
'topactivity' ,
'answers' ,
'questions' ,
'tags' ,
'articles' ,
'badges' ,
'bookmarks' ,
'bounties'
];
function loadTab ( url , name ) {
return new Promise ( async ( resolve , reject ) => {
try {
const tabUrl = new URL ( url );
tabUrl . searchParams . set ( 'tab' , name );
navigate ( tabUrl , { allow_status: [ 404 ] });
const html_ = html ();
tag_html ( name );
resolve ( html_ );
} catch ( e ) { reject ( e ); }
});
}
const userIdFromUrl = ( input . url && input . url . includes ( 'https://stackoverflow.com/users/' ))
? input . url . replace ( 'https://stackoverflow.com/users/' , '' ). split ( '/' ). shift ()
: null ;
const userId = input . user_id || userIdFromUrl || 1 ;
const userUrl = `https://stackoverflow.com/users/ ${ userId } ` ;
navigate ( userUrl , { allow_status: [ 404 ] });
tag_html ( 'user' );
Promise . all ( tabs . map ( async tabName => loadTab ( userUrl , tabName )));
多个 tag_response
Interaction code
Parser code
close_popup ( '._1piuevz' , '._1piuevz' );
tag_response ( 'stay' , / \/ StaysPdpSections/ );
tag_response ( 'calendar' , / \/ PdpAvailabilityCalendar/ );
tag_response ( 'reviews' , /api \/ v3 \/ StaysPdpReviews/ );
navigate ( input . url + '?enable_auto_translate=false#availability-calendar' );
if ( el_exists ( 'img[src*="error_pages/404"]' ))
return dead_page ( 'Page not found' );
const room_id = / \/ ( \d + )(?: \? |$ ) / . exec ( input . url )?.[ 1 ];
if ( el_exists ( '._wgmchy ._1qx9l5ba' ))
click ( '._wgmchy ._1qx9l5ba' );
wait ( '#data-state, #data-deferred-state' ,{ timeout: 6e4 });
wait_for_parser_value ( 'stay' );
wait_for_parser_value ( 'calendar' );
wait_for_parser_value ( 'reviews' );
tag_html ( 'html' );