Functions
Interaction Functions
This article lists and explains the available commands within the Interaction code for writing a scraper using the IDE.
Commands marked with a star ⭐ are proprietary functions developed by Bright Data.
bad_input
Mark the scraper input as bad. Will prevent any crawl retries error_code=bad_input
bad_input();
bad_input('Missing search term');
blocked
Mark the page as failed because of the website refusing access (error_code=blocked)
blocked();
blocked('Login page was shown');
⭐ bounding_box
The box of coordinates that describes the area of an element (relative to the page, not the browser viewport). Only the first element matched will be measured
selector
: A valid CSS selector for the element
let box = bounding_box('.product-list');
// box == {
// top: 10,
// right: 800,
// bottom: 210,
// left: 200,
// x: 200,
// y: 10,
// width: 600,
// height: 200,
// }
⭐ browser_size
Returns current browser window size
TBD
⭐ capture_graphql
Capture and replay graphql requests with changed variables
- options: Params to control graphql request to capture
- url
- payload
⭐ click
Click on an element (will wait for the element to appear before clicking on it)
selector
: Element selector
click('#show-more');
$('#show-more').click()
// Click the closest match to the passed coordinates
// (relative to the page).
// For example, clicking the center pin in a map
let box = bounding_box('#map')
let center = {x: (box.left+box.right)/2, y: (box.top+box.bottom)/2};
click('.map-pin', {coordinates: center});
⭐ close_popup
Popups can appear at any time during a crawl and it’s not always clear when you should be waiting for or closing them. Add close_popup()
at the top of your code to add a background watcher that will close the popup when it appears. If a popup appears multiple times, it will always be closed
- popup selector: A valid CSS selector
- close selector: A valid CSS selector
- options:
click_inside
: selector of parent iframe which contains close button selector
close_popup('.popup', '.popup_close');
close_popup('iframe.with-popup', '.popup_close', {click_inside: 'iframe.with-popup'});
collect
Adds a line of data to the dataset created by the crawler
syntax: collect(<data_line>[, <validate_fn>]);
data_line
: A object with the fields you want to collectvalidate_fn
: Optional function to check that the line data is valid
collect({price: data.price});
collect(product, p=>{
if (!p.title)
throw new Error('Product is missing a title');
})
console
Log messages from the interaction code
console.log(1, 'brightdata', [1, 2], {key: value});
console.error(1, 'brightdata', [1, 2], {key: value});
country
Configure your crawl to run from a specific country
syntax: country(<code>);
code
: 2-character ISO country code
country('us');
dead_page
Mark a page as a dead link so you can filter it from your future collections error_code=dead_page
dead_page();
dead_page('Product was removed');
⭐ detect_block
Detects a block on the page
resource
: An object specifying the resource required for the detectionselector
condition
: An object specifying how the resource should be processed for detectionexists
has_text
detect_block({selector: '.foo'}, {exists: true});
detect_block({selector: '.bar'}, {has_text: 'text'});
detect_block({selector: '.baz'}, {has_text: /regex_pattern/});
⭐ disable_event_listeners
Stop all event listeners on the page from running. track_event_listeners()
must have been called first
event_types
: Specific event types that should be disabled
disable_event_listeners();
disable_event_listeners(['hover', 'click']);
el_exists
Check if an element exists on page, and return a boolean accordingly
selector
: Valid CSS selectortimeout
: Timeout duration to wait for the element to appear on the page
el_exists('#example'); // => true
el_exists('.does_not_exist'); // => false
el_exists('.does_not_exist', 5e3); // => false (after 5 seconds)
el_is_visible
Check if element is visible on page
- selector: Valid CSS selector
- timeout: Timeout duration to wait for the element to be visible on the page
el_is_visible('#example');
el_is_visible('.is_not_visible', 5e3); // false (after 5 seconds)
embed_html_comment
Add a comment in the page HTML. Can be used to embed metadata inside HTML snapshots.
comment
: Body of the comment
embed_html_comment('trace-id: asdf123');
⭐ font_exists
Assert the capability of the browser to render the given font family on the page
syntax: font_exists(<font-family>);
font_exists('Liberation Mono');
⭐ freeze_page
Force the page to stop making changes. This can be used to save the page in a particular state so page snapshots that run after crawl won’t see a different page state than you see now. This command is experimental. If you see problems, please report them to support
freeze_page();
⭐ hover
hover on an element (will wait for the element to appear before hovering on it)
syntax: hover(<selector>);
selector
: Element selector
hover('#item');
⭐ html_capture_options
Influence the process of the HTML capturing
options
: An object which accepts options defining how HTML capturing should be processedcoordinate_attributes
html_capture_options({
coordinate_attributes: true,
});
Image
Collect image data
src
: Image URL or data:image URI string
let i = new Image('https://example.com/image.png');
collect({image: i});
input
Global object available to the interaction code. Provided by trigger input or next_stage()
calls
navigate(input.url);
job
Global object available to the interaction code. Provided by trigger input or next_stage()
calls
let {created} = job;
load_html
Load html and return Cheerio instance
html
: Any HTML string
let $$ = load_html('<p id="p1">p1</p><p id="p2">p2</p>');
collect({data: $$('#p2').text()});
⭐ load_more
Scroll to the bottom of a list to trigger loading more items. Useful for lazy-loaded infinite-scroll sites
selector
: Selector for the element that contains the lazy-loaded items
load_more(<selector>);
load_more('.search-results');
load_more('.search-results', {children: '.result-item', trigger_selector: '.btn-load-more', timeout: 10000});
load_sitemap
Read a list of urls from a sitemap xml (supports sitemap indexes, and .gz compressed sitemaps. see examples.)
let {pages} = load_sitemap({url: 'https://example.com/sitemap.xml.gz'});
let {children} = load_sitemap({url: 'https://example.com/sitemap-index.xml'});
location
Object with info about current location. Available fields: href
navigate('https://example.com');
location.href; // "https://example.com/"
Money
Collect price/money data
value
: Amount of moneycurrency
: Currency code
let p = new Money(10, 'USD');
collect({product_price: p});
⭐ mouse_to
Move the mouse to the specified (x,y) position
syntax: mouse_to(<x>, <y>);
x
: Target x positiony
: Target y position
mouse_to(0, 0);
navigate
Navigate the browser to a URL
syntax: navigate(<url>);
- A 404 status code will throw a
dead_page
error by default. Use opt.allow_status
to override this url
: A URL to navigate toopt
: navigate options (see examples)
navigate(input.url);
navigate('https://example.com');
// waits until DOM content loaded event is fired in the browser
navigate(`url`, {wait_until: 'domcontentloaded'});
// adds a referer to the navigation
navigate(`url`, {referer: `url`});
// the number of milliseconds to wait for. Default is 30000 ms
navigate(`url`, {timeout: 45000});
// Don't throw an error if this URL sends a 404 status code
navigate(`url`, {allow_status: [404]});
// Specify browser width/height
navigate(`url`, {
fingerprint: {screen: {width: 400, height: 400}},
});
next_stage
Run the next stage of the crawler with the specified input
input
: Input object to pass to the next browser session
next_stage({url: 'http://example.com', page: 1});
parse
Parse the page data
let page_data = parse();
collect({
title: page_data.title,
price: page_data.price,
});
preserve_proxy_session
Preserve proxy session across children of this page
preserve_proxy_session();
⭐ press_key
Type special characters like Enter or Backspace in the currently focused input (usually used after typing something in a search box)
press_key('Enter');
press_key('Backspace');
⭐ proxy_location
Configure your crawl to run from a specific location. Unless you need high resolution control over where your crawl is running from, you probably want to use country(code)
instead
configuration
: Object with a desired proxy location, check examples for more info
proxy_location({country: 'us'});
// lat in range: [-85, 85], long in range: [-180, 180]
proxy_location({lat: 37.7749, long: 122.4194});
// radius in km
proxy_location({lat: 37.7749, long: 122.4194, country: 'US', radius: 100});
⭐ redirect_history
Returns history of URL redirects since last navigate
navigate('http://google.com');
let redirects = redirect_history();
// returns:
// [
// 'http://google.com',
// 'http://www.google.com',
// 'https://www.google.com/',
// ]
rerun_stage
Run this stage of the crawler again with new input
rerun_stage({url: 'http://example.com/other-page'});
resolve_url
Returns the final URL that the given url argument leads to
url
: URL string/instance
let {href} = parse().anchor_elem_data;
collect({final_url: resolve_url(href)});
response_headers
Returns the response headers of the last page load
let headers = response_headers();
console.log('content-type', headers['content-type']);
request
Make a direct HTTP request
url
|options
: the url to make the request to, or request options (see examples)
let res = request('http://www.example.com');
let res = request({url: 'http://www.example.com', method: 'POST', headers: {'Content-type': 'application/json'}, body: {hello: 'world'}})
⭐ right_click
The same as click but use right mouse button instead (will wait for the element to appear before clicking on it)
syntax: right_click(<selector>);
selector
: Element selector
right_click('#item');
run_stage
Run a specific stage of the crawler with a new browser session
- stage: Which stage to run (1 is first stage)
- input: Input object to pass to the next browser session
run_stage(2, {url: 'http://example.com', page: 1});
⭐ scroll_to
Scroll the page so that an element is visible.If you’re doing this to trigger loading some more elements from a lazy loaded list, use load_more()
. Defaults to scrolling in a natural way, which may take several seconds. If you want to jump immediatley, use {immediate: true}
syntax: scroll_to(<selector>);
selector
: Selector of the element you want to scroll to
scroll_to('.author-profile');
scroll_to('top'); // scroll to the top of the page
scroll_to('bottom'); // scroll to the bottom of the page
scroll_to('top', {immediate: true}); // jump to top of page immediately
⭐ scroll_to_all
Scroll through the page so that all the elements matching the selector will be visible on screen
syntax: scroll_to_all(<selector>);
selector
: Selector of the elements you want to scroll through
scroll_to_all('.author-profiles');
⭐ select
Pick a value from a select element
syntax: select(<select>, <value>);
selector
: Element selector
select('#country', 'Canada');
set_lines
An array of lines to add to your dataset at the end of this page crawl. Each call to set_lines()
will override previous ones, and only the last set of lines will be added into the dataset (tracked per page crawl). This is a good fit when the scraper is set to collect partial on errors. You can keep calling set_lines()
with the data you gathered so far, and the last call will be used if the page crawl throws an error
syntax: set_lines(<data_line>[, <validate_fn>]);
lines
: An array of data lines to add to your final datasetvalidate_fn
: Optional function to check that the line data is valid (run once per line)
set_lines(products_so_far);
set_lines(products_so_far, i=>{
if (!i.price)
throw new Error('Missing price');
});
set_session_cookie
Sets a cookie with the given cookie data; may overwrite equivalent cookies if they exist
set_session_cookie(`domain`, `name`, `value`);
set_session_headers
Set extra headers for all the HTTP requests
headers
: Object with extra headers in key-value format
set_session_headers({'HEADER_NAME': 'HEADER_VALUE'});
⭐ solve_captcha
Solve any captchas shown on the page
solve_captcha();
solve_captcha({type: 'simple', selector: '#image', input: '#input'});
status_code
Returns the status code of the last page load
collect({status_code: status_code()});
⭐ tag_all_responses
Save the responses from all browser request that match
field
: The name of the tagged fieldpattern
: The URL pattern to matchoptions
: Set options.jsonp=true to parse response bodies that are in jsonp format. This will be automatically detected when possible
tag_all_responses(<field>, <pattern>, <options>);
tag_all_responses('resp', /url/, {jsonp: true});
tag_all_responses('resp', /url/, {allow_error: true});
tag_all_responses('profiles', /\/api\/profile/);
navigate('https://example.com/sports');
let profiles = parse().profiles;
for (let profile of profiles)
collect(profile);
⭐ tag_download
Allows to get files downloaded by browser
url
: A pattern or a string to match requests against
let SEC = 1000;
let download = tag_download(/example.com\/foo\/bar/);
click('button#download');
let file1 = download.next_file({timeout: 10*SEC});
let file2 = download.next_file({timeout: 20*SEC});
collect({file1, file2});
⭐ tag_image
Save the image url from an element
- field: The name of the tagged field
- selector: A valid CSS selector
tag_image(field, selector);
tag_image('image', '#product-image');
⭐ tag_response
Save the response data from a browser request
syntax: tag_response(<field>, <pattern>, <options>);
name
: The name of the tagged fieldpattern
: The URL pattern to matchoptions
: Set options.jsonp=true to parse response bodies that are in jsonp format. This will be automatically detected when possible
tag_response('resp', /url/, {jsonp: true});
tag_response('resp', /url/, {allow_error: true});
tag_response('resp', (req, res)=>{
if (req.url.includes('/api/'))
{
let request_body = req.body;
let request_headers = req.headers;
let response_body = res.body;
let response_headers = res.headers;
}
});
tag_response('teams', /\/api\/teams/);
navigate('https://example.com/sports');
let teams = parse().teams;
for (let team of teams)
collect(team);
⭐ tag_screenshot
Save a screenshot of the page HTML
syntax: tag_screenshot(<field>, <options>);
- field: The name of the tagged field
- options: Download options (see example)
tag_screenshot('html_screenshot', {filename: 'screen'});
tag_screenshot('view', {full_page: false}); // full_page defaults to true
⭐ tag_script
Extract some JSON data saved in a script on the page
syntax: tag_script(<field>, <selector>);
- name: The name of the tagged script
- selector: The selector of the script to tag
tag_script('teams', '#preload-data');
tag_script('ssr_state', '#__SSR_DATA__');
navigate('https://example.com/');
collect(parse().ssr_state);
⭐ tag_serp
Parse the current page as a search engine result page
field
: The name of the tagged fieldtype
: Parser type: (e.g. bing, google)
tag_serp('serp_bing_results', 'bing')
tag_serp('serp_google_results', 'google')
⭐ tag_video
Save the video url from an element
field
: The name of the tagged fieldselector
: A valid CSS selectoropt
: download options (see example)
tag_video(field, selector);
tag_video('video', '#product-video', {download: true});
⭐ tag_window_field
Tag a js value from the browser page
field
: The path to the relevant data
tag_window_field(field, key);
tag_window_field('initData', '__INIT_DATA__');
⭐ track_event_listeners
Start tracking the event listeners that the browser creates. It’s needed to run disable_event_listeners()
later
track_event_listeners();
⭐ type
Enter text into an input (will wait for the input to appear before typing)
selector
: Element selectortext
: Text to enter
type(<selector>, <text>);
type('#location', 'New York');
// replacing text in input if it is not empty
type(<selector>, <text>, {replace: true});
// type text to an element with id ending "input-box" (e.g. <input id="c2E57-input-box">)
type('[id$=input-box]', <text>);
// dispatching 'Enter' key press
type(<selector>, ['Enter']);
// typing text and then dispatching 'Enter' key press
type(<selector>, ['Some text', 'Enter']);
// deleting 1 char from input
type(<selector>, ['Backspace']);
URL
URL class from NodeJS standard “url” module
url
: URL string
let u = new URL('https://example.com');
⭐ verify_requests
Monitor failed requests with a callback function
callback
: A function which will be called on each failed request with an object in format:{url, error, type, response}
verify_requests(({url, error, type, response})=>{
if (response.status!=404 && type=='Font')
throw new Error('Font failed to load');
});
Video
Collect video data
src
: Video URL
let v = new Video('https://example.com/video.mp4');
collect({video: v});
⭐ wait
Wait for an element to appear on the page
selector
: Element selectoropt
: wait options (see examples)
wait(<selector>);
wait('#welcome-splash');
wait('.search-results .product');
wait('[href^="/product"]');
// the number of milliseconds to wait for. Default is 30000 ms
wait(<selector>, {timeout: 5000});
// wait for element to be hidden
wait(<selector>, {hidden: true});
// wait for element inside in an iframe
wait(<selector>, {inside: '#iframe_id'});
⭐ wait_any
Wait for any matching condition to succeed
wait_any(['#title', '#notfound']);
wait_for_parser_value
Wait for a parser field to contain a value. This can be useful after you click something to wait for some data to appear
field
: The parser value path to wait onvalidate_fn
: An optional callback function to validate that the value is correctopt
: Extra options (e.g. timeout)
wait_for_parser_value(<field>[, <validate_fn>][, opt]);
wait_for_parser_value('profile');
wait_for_parser_value('listings.0.price', v=>{
return parseInt(v)>0;
}, {timeout: 5000});
⭐ wait_for_text
Wait for an element on the page to include some text
selector
: Element selectortext
: The text to wait for
wait_for_text(<selector>, <text>);
wait_for_text('.location', 'New York');
⭐ wait_hidden
Wait for an element to not be visible on the page (removed or hidden)
selector
: Element selector
wait_hidden(<selector>);
wait_hidden('#welcome-splash');
wait_hidden(<selector>, {timeout: 5000});
⭐ wait_network_idle
Wait the browser network has been idle for a given time
timeout
: Wait for browser network to be idle for X millisecondsoptions
: ignore: an array of patterns to exclude requests from monitoring timeout: how long the network needs to be idle in milliseconds (default 500)
wait_network_idle();
wait_network_idle({
timeout: 1e3,
ignore: [/long_request/, 'https://example.com'],
});
⭐ wait_page_idle
Wait until no changes are being made on the DOM tree for a given time
timeout
: Milliseconds to wait for no changesoptions
: An object, which can accept a ignore argument to exclude some elements from monitoring
wait_page_idle();
wait_page_idle({
ignore: [<selector1>, <selector2>],
idle_timeout: 1000,
});
⭐ wait_visible
Wait for an element to be visible on the page
selector
: Element selector
wait_visible(<selector>);
wait_visible('#welcome-splash');
wait_visible(<selector>, {timeout: 5000});
$
Helper for jQuery-like expressions
selector
: Element selector
$(<selector>);
wait($('.store-card'))
⭐ emulate_device
View pages as a mobile device. This command will change user agent and screen parameters (resolution and device pixel ratio)
device
: A string with the name of device
emulate_device('iPhone X');
emulate_device('Pixel 2');
Parser Functions
This article lists and explains the available commands within the Parser code for writing a scraper using the IDE.
input
Global variable available to the parser code
let url = input.url;
$
An instance of cheerio
$('#example').text();
$('$example').attr('href');
$('#example').text_sane(); /* This is like $().text() but also trims text and replace all space characters with single space "a b \t\n\n c" -> "a b c" */
location
A global variable available to the parser code. Object with info about current location
let current_url = location.href;
Image
Collect image data
let i = new Image('https://example.com/image.png');
collect({image: i});
Video
Collect video data
let v = new Video('https://example.com/video.mp4');
collect({video: v});
Money
Collect price/money data
let p = new Money(10, 'USD');
collect({product_price: p});