Overview
Scraper Studio IDE uses two types of code to build web scrapers:- Interaction Code - Controls browser automation and navigation
- Parser Code - Extracts and structures data from HTML
IDE Interaction code
What It Does
Interaction code controls a real browser session to:- Navigate to URLs
- Wait for elements to load
- Click buttons and links
- Fill out forms
- Scroll pages
- Handle dynamic content (JavaScript-heavy sites)
Commands marked with a star ⭐ are proprietary functions developed by Bright Data.
bad_input
Mark the scraper input as bad. This will prevent any crawl retries error_code=bad_input
blocked
Mark the page as failed because of the website refusing access (error_code=blocked)
⭐ bounding_box
The box of coordinates that describes the area of an element (relative to the page, not the browser viewport). Only the first element matched will be measured
selector: A valid CSS selector for the element
⭐ browser_size
Returns current browser window size
⭐ capture_graphql
Capture and replay graphql requests with changed variables
- options: Params to control graphql request to capture
- url
- payload - matches key-value pairs in the target’s request payload. Replace
idwith the actual key name that differentiates endpoints, and provide the corresponding value to identify the specific response you want to capture
⭐ click
Click on an element (will wait for the element to appear before clicking on it)
selector: Element selector
⭐ close_popup
Popups can appear at any time during a crawl and it’s not always clear when you should be waiting for or closing them. Add close_popup() at the top of your code to add a background watcher that will close the popup when it appears. If a popup appears multiple times, it will always be closed
- popup selector: A valid CSS selector
- close selector: A valid CSS selector
- options:
click_inside: selector of parent iframe which contains close button selector
collect
Adds a line of data to the dataset created by the crawler
syntax: collect(<data_line>[, <validate_fn>]);
data_line: A object with the fields you want to collectvalidate_fn: Optional function to check that the line data is valid
console
Log messages from the interaction code
country
Configure your crawl to run from a specific country syntax: country(<code>);
code: 2-character ISO country code
dead_page
Mark a page as a dead link so you can filter it from your future collections error_code=dead_page
⭐ detect_block
Detects a block on the page
resource: An object specifying the resource required for the detectionselector
condition: An object specifying how the resource should be processed for detectionexistshas_text
⭐ disable_event_listeners
Stop all event listeners on the page from running. track_event_listeners() must have been called first
event_types: Specific event types that should be disabled
el_exists
Check if an element exists on page, and return a boolean accordingly
selector: Valid CSS selectortimeout: Timeout duration to wait for the element to appear on the page
el_is_visible
Check if element is visible on page
- selector: Valid CSS selector
- timeout: Timeout duration to wait for the element to be visible on the page
embed_html_comment
Add a comment in the page HTML. Can be used to embed metadata inside HTML snapshots.
comment: Body of the comment
⭐ font_exists
Assert the capability of the browser to render the given font family on the page syntax: font_exists(<font-family>);
⭐ freeze_page
Force the page to stop making changes. This can be used to save the page in a particular state so page snapshots that run after crawl won’t see a different page state than you see now. This command is experimental. If you see problems, please report them to support
⭐ hover
hover on an element (will wait for the element to appear before hovering on it) syntax: hover(<selector>);
selector: Element selector
⭐ html_capture_options
Influence the process of the HTML capturing
options: An object which accepts options defining how HTML capturing should be processedcoordinate_attributes
Image
Collect image data
src: Image URL or data:image URI string
input
Global object available to the interaction code. Provided by trigger input or next_stage() calls
job
Global object available to the interaction code. Provided by trigger input or next_stage() calls
load_html
Load html and return Cheerio instance
html: Any HTML string
⭐ load_more
Scroll to the bottom of a list to trigger loading more items. Useful for lazy-loaded infinite-scroll sites
selector: Selector for the element that contains the lazy-loaded items
load_sitemap
Read a list of urls from a sitemap xml (supports sitemap indexes, and .gz compressed sitemaps. see examples.)
location
Object with info about current location. Available fields: href
Money
Collect price/money datavalue: Amount of moneycurrency: Currency code
⭐ mouse_to
Move the mouse to the specified (x,y) position syntax: mouse_to(<x>, <y>);
x: Target x positiony: Target y position
navigate
Navigate the browser to a URL syntax: navigate(<url>);
- A 404 status code will throw a
dead_pageerror by default. Use opt.allow_statusto override this url: A URL to navigate toopt: navigate options (see examples)
next_stage
Run the next stage of the crawler with the specified input
input: Input object to pass to the next browser session
parse
Parse the page data
preserve_proxy_session
Preserve proxy session across children of this page
⭐ press_key
Type special characters like Enter or Backspace in the currently focused input (usually used after typing something in a search box)
⭐ proxy_location
Configure your crawl to run from a specific location. Unless you need high resolution control over where your crawl is running from, you probably want to use country(code) instead
configuration: Object with a desired proxy location, check examples for more info
⭐ redirect_history
Returns history of URL redirects since last navigate
rerun_stage
Run this stage of the crawler again with new input
resolve_url
Returns the final URL that the given url argument leads to
url: URL string/instance
response_headers
Returns the response headers of the last page load
request
Make a direct HTTP request
url|options: the url to make the request to, or request options (see examples)
⭐ right_click
The same as click but use right mouse button instead (will wait for the element to appear before clicking on it) syntax: right_click(<selector>);
selector: Element selector
run_stage
Run a specific stage of the crawler with a new browser session
- stage: Which stage to run (1 is first stage)
- input: Input object to pass to the next browser session
⭐ scroll_to
Scroll the page so that an element is visible.If you’re doing this to trigger loading some more elements from a lazy loaded list, use load_more(). Defaults to scrolling in a natural way, which may take several seconds. If you want to jump immediatley, use {immediate: true} syntax: scroll_to(<selector>);
selector: Selector of the element you want to scroll to
⭐ scroll_to_all
Scroll through the page so that all the elements matching the selector will be visible on screen syntax: scroll_to_all(<selector>);
selector: Selector of the elements you want to scroll through
⭐ select
Pick a value from a select element syntax: select(<select>, <value>);
selector: Element selector
set_lines
An array of lines to add to your dataset at the end of this page crawl. Each call to set_lines() will override previous ones, and only the last set of lines will be added into the dataset (tracked per page crawl). This is a good fit when the scraper is set to collect partial on errors. You can keep calling set_lines() with the data you gathered so far, and the last call will be used if the page crawl throws an error syntax: set_lines(<data_line>[, <validate_fn>]);
lines: An array of data lines to add to your final datasetvalidate_fn: Optional function to check that the line data is valid (run once per line)
set_session_cookie
Sets a cookie with the given cookie data; may overwrite equivalent cookies if they exist
set_session_headers
Set extra headers for all the HTTP requests
headers: Object with extra headers in key-value format
⭐ solve_captcha
Solve any captchas shown on the page
status_code
Returns the status code of the last page load
⭐ tag_all_responses
Save the responses from all browser request that match
field: The name of the tagged fieldpattern: The URL pattern to matchoptions: Set options.jsonp=true to parse response bodies that are in jsonp format. This will be automatically detected when possible
⭐ tag_download
Allows to get files downloaded by browser
url: A pattern or a string to match requests against
⭐ tag_image
Save the image url from an element
- field: The name of the tagged field
- selector: A valid CSS selector
⭐ tag_response
Save the response data from a browser request syntax: tag_response(<field>, <pattern>, <options>);
name: The name of the tagged fieldpattern: The URL pattern to matchoptions: Set options.jsonp=true to parse response bodies that are in jsonp format. This will be automatically detected when possible
⭐ tag_screenshot
Save a screenshot of the page HTML syntax: tag_screenshot(<field>, <options>);
- field: The name of the tagged field
- options: Download options (see example)
⭐ tag_script
Extract some JSON data saved in a script on the page
- name: The name of the tagged script
-
selector: The selector of the script to tag
⭐ tag_serp
Parse the current page as a search engine result page
field: The name of the tagged fieldtype: Parser type: (e.g. bing, google)
⭐ tag_video
Save the video url from an element
field: The name of the tagged fieldselector: A valid CSS selectoropt: download options (see example)
⭐ tag_window_field
Tag a js value from the browser page
field: The path to the relevant data
⭐ track_event_listeners
Start tracking the event listeners that the browser creates. It’s needed to run disable_event_listeners() later
⭐ type
Enter text into an input (will wait for the input to appear before typing)
selector: Element selectortext: Text to enter
URL
URL class from NodeJS standard “url” module
url: URL string
⭐ verify_requests
Monitor failed requests with a callback function
callback: A function which will be called on each failed request with an object in format:{url, error, type, response}
Video
Collect video data
src: Video URL
⭐ wait
Wait for an element to appear on the page
selector: Element selectoropt: wait options (see examples)
⭐ wait_any
Wait for any matching condition to succeed
wait_for_parser_value
Wait for a parser field to contain a value. This can be useful after you click something to wait for some data to appear
field: The parser value path to wait onvalidate_fn: An optional callback function to validate that the value is correctopt: Extra options (e.g. timeout)
⭐ wait_for_text
Wait for an element on the page to include some text
selector: Element selectortext: The text to wait for
⭐ wait_hidden
Wait for an element to not be visible on the page (removed or hidden)
selector: Element selector
⭐ wait_network_idle
Wait the browser network has been idle for a given time
timeout: Wait for browser network to be idle for X millisecondsoptions: ignore: an array of patterns to exclude requests from monitoring timeout: how long the network needs to be idle in milliseconds (default 500)
⭐ wait_page_idle
Wait until no changes are being made on the DOM tree for a given time
timeout: Milliseconds to wait for no changesoptions: An object, which can accept a ignore argument to exclude some elements from monitoring
⭐ wait_visible
Wait for an element to be visible on the page
selector: Element selector
$
Helper for jQuery-like expressions
selector: Element selector
Clicking on selector based on text
Helper for jQuery-like expressions
selector: Element selector
⭐ emulate_device
View pages as a mobile device. This command will change user agent and screen parameters (resolution and device pixel ratio)
device: A string with the name of device
Here is the full list of device names
Here is the full list of device names
- Blackberry PlayBook
- Blackberry PlayBook landscape
- BlackBerry Z30
- BlackBerry Z30 landscape
- Galaxy Note 3
- Galaxy Note 3 landscape
- Galaxy Note II
- Galaxy Note II landscape
- Galaxy S III
- Galaxy S III landscape
- Galaxy S5
- Galaxy S5 landscape
- Galaxy S8
- Galaxy S8 landscape
- Galaxy S9+
- Galaxy S9+ landscape
- Galaxy Tab S4
- Galaxy Tab S4 landscape
- iPad
- iPad landscape
- iPad (gen 6)
- iPad (gen 6) landscape
- iPad (gen 7)
- iPad (gen 7) landscape
- iPad Mini
- iPad Mini landscape
- iPad Pro
- iPad Pro landscape
- iPad Pro 11
- iPad Pro 11 landscape
- iPhone 4
- iPhone 4 landscape
- iPhone 5
- iPhone 5 landscape
- iPhone 6
- iPhone 6 landscape
- iPhone 6 Plus
- iPhone 6 Plus landscape
- iPhone 7
- iPhone 7 landscape
- iPhone 7 Plus
- iPhone 7 Plus landscape
- iPhone 8
- iPhone 8 landscape
- iPhone 8 Plus
- iPhone 8 Plus landscape
- iPhone SE
- iPhone SE landscape
- iPhone X
- iPhone X landscape
- iPhone XR
- iPhone XR landscape
- iPhone 11
- iPhone 11 landscape
- iPhone 11 Pro
- iPhone 11 Pro landscape
- iPhone 11 Pro Max
- iPhone 11 Pro Max landscape
- iPhone 12
- iPhone 12 landscape
- iPhone 12 Pro
- iPhone 12 Pro landscape
- iPhone 12 Pro Max
- iPhone 12 Pro Max landscape
- iPhone 12 Mini
- iPhone 12 Mini landscape
- iPhone 13
- iPhone 13 landscape
- iPhone 13 Pro
- iPhone 13 Pro landscape
- iPhone 13 Pro Max
- iPhone 13 Pro Max landscape
- iPhone 13 Mini
- iPhone 13 Mini landscape
- JioPhone 2
- JioPhone 2 landscape
- Kindle Fire HDX
- Kindle Fire HDX landscape
- LG Optimus L70
- LG Optimus L70 landscape
- Microsoft Lumia 550
- Microsoft Lumia 950
- Microsoft Lumia 950 landscape
- Nexus 10
- Nexus 10 landscape
- Nexus 4
- Nexus 4 landscape
- Nexus 5
- Nexus 5 landscape
- Nexus 5X
- Nexus 5X landscape
- Nexus 6
- Nexus 6 landscape
- Nexus 6P
- Nexus 6P landscape
- Nexus 7
- Nexus 7 landscape
- Nokia Lumia 520
- Nokia Lumia 520 landscape
- Nokia N9
- Nokia N9 landscape
- Pixel 2
- Pixel 2 landscape
- Pixel 2 XL
- Pixel 2 XL landscape
- Pixel 3
- Pixel 3 landscape
- Pixel 4
- Pixel 4 landscape
- Pixel 4a (5G)
- Pixel 4a (5G) landscape
- Pixel 5
- Pixel 5 landscape
- Moto G4
- Moto G4 landscape
Shadow DOM support
Interaction commands that accept a list of selectors support Shadow DOM interactions. When passing an array of selectors:- One of the selectors must target the shadow host element
- All selectors that follow it in the array will be resolved inside that shadow root
shadow-root-selectortargets the element that contains the shadow rootbutton-selectoris resolved inside that shadow root
click, wait, type, etc.).
Parser Functions
Overview
Parser code is responsible for extracting and structuring data from HTML content. Scraper Studio Parser code uses the pre-installed Cheerio library—a library that provides jQuery-like syntax for parsing HTML documents.Below you can find explanations of the available commands within the Parser code for writing a scraper using the IDE: