Main Article Title
This is the introduction paragraph with bold text and italics.
Subheading
- List item one
- List item two
This is a blockquote from the articleLink text
sh Code Example
You are unable to access g2.com
``` ## Why maintaining anti-blocking yourself is costly Each detection layer requires a different countermeasure, and each countermeasure requires ongoing maintenance. IP rotation means sourcing, validating, and retiring proxies. TLS fingerprinting means patching your HTTP client to randomize handshake signatures, something most libraries don't support natively. Browser fingerprinting means keeping headless browser patches current as detection systems update weekly. CAPTCHA solving means integrating third-party solvers and handling failures. In practice, teams that build this in-house find that anti-blocking maintenance consumes a significant share of their engineering effort, often more time than building the actual data pipeline. When a target site updates its defenses, your scraper fails silently, returning empty fields or stale data while appearing to work normally. By the time you notice, your downstream systems have been ingesting bad data for hours or days. Every Bright Data scraping product shares the same anti-blocking engine. The difference between products is what you get back and how much control you keep. ## How Bright Data handles each detection layer **IP reputation.** Requests are routed through residential IPs from real ISPs across 195+ countries. Anti-bot systems treat these as regular household traffic. IPs are automatically rotated per request and flagged addresses are retired from the pool. **TLS fingerprinting.** When your HTTP client opens a connection, it sends a TLS Client Hello that anti-bot systems hash into a JA3 fingerprint. Python `requests`, for example, always produces the same hash, which is a known bot signature. Bright Data generates a unique TLS fingerprint per request, matching the diversity of real browser populations. **Browser fingerprinting.** Sites collect Canvas renders, WebGL data, installed fonts, screen dimensions, and audio context to build a device profile. Bright Data emulates complete, consistent browser environments that pass checks from Cloudflare Turnstile, Akamai Bot Manager, and other major detection systems. **Behavioral analysis.** Anti-bot systems track request timing, navigation sequences, and interaction patterns. Bright Data varies request timing, simulates realistic navigation patterns, and manages session state to match human browsing behavior. **CAPTCHA solving.** CAPTCHAs are solved automatically, including reCAPTCHA, hCaptcha, Cloudflare Turnstile, and others. You never see the challenge. The response arrives as if no CAPTCHA existed. ## Which product fits your situation Each product inherits the anti-blocking capabilities described above. The difference is what you get back and how much control you need. **I want full control over my HTTP client and just need IPs that won't get flagged.** Use [Bright Data Proxies](/proxy-networks/introduction). Four types are available: residential (highest trust, best for protected sites), datacenter (fastest, best for unprotected sites), ISP (residential trust at datacenter speed), and mobile (highest trust, best for heavily protected targets). Proxies solve IP reputation but leave the other four detection layers to you. **I need raw HTML from protected sites without building anti-blocking myself.** Use [Web Unlocker](/scraping-automation/web-unlocker/introduction). Send a URL, get back clean HTML. All five detection layers are handled in a single API call: IP rotation, TLS fingerprinting, browser fingerprinting, behavioral emulation, and CAPTCHA solving. **I need to navigate to click through pagination, or interact with JavaScript-heavy pages.** Use [Bright Data Browser API](/scraping-automation/scraping-browser/introduction). Full cloud-hosted browsers controlled through Puppeteer or Playwright with all anti-blocking built in. Unlike Web Unlocker, Browser API runs a real GUI browser (not headless), which produces authentic rendering artifacts that detection systems are far less likely to flag. **I need search engine results from Google, Bing, or other engines without getting blocked.** Use [SERP API](/scraping-automation/serp-api/introduction). Search engines are among the most aggressively protected targets. SERP API handles all the anti-blocking and returns structured JSON with organic results, ads, featured snippets, and knowledge panels already parsed. **I want structured data from popular websites without writing or maintaining parsers.** Use [Web Scraper API](/datasets/scrapers/scrapers-library/overview). 650+ pre-built scrapers for sites like Amazon, LinkedIn, Instagram, YouTube, TikTok, and Google Maps. You get clean JSON with an average of 220+ data fields per scraper. When target sites change, Bright Data updates the scrapers. ## Common misconceptions **"Residential proxies alone solve blocking."** They handle IP reputation, but modern systems also check TLS fingerprints, browser fingerprints, and behavioral patterns. Proxies are one layer out of five. **"Stealth plugins make headless browsers undetectable."** Cloudflare Turnstile and Akamai Bot Manager detect patched headless browsers even with Playwright Stealth applied. Browser API solves this by running a real GUI browser rather than a patched headless one. **"Slowing down requests prevents blocks."** Rate limiting helps avoid basic IP bans, but sites like LinkedIn, Instagram, and Amazon use session-level fingerprinting that detects automation regardless of speed. **"Proxies and Web Unlocker do the same thing."** Proxies route your requests through different IPs only. Web Unlocker also manages TLS fingerprints, solves CAPTCHAs, emulates browser behavior, handles JavaScript rendering, and retries with fresh fingerprints on failure. ## FAQs"data\_format": "markdown" to receive Markdown SERP structure.
"brd\_json=1" to receive JSON SERP structure.
response_id>}>
Use the `response_id` from the webhook to fetch the actual result:
```sh wrap theme={null}
curl --silent --compressed \
"https://api.brightdata.com/unblocker/get_result?response_id=Send a URL, receive structured JSON/CSV. No proxies, browsers, anti-bot systems or parsing to manage. 660+ pre-built scrapers cover LinkedIn, Instagram, TikTok, Amazon, Google Maps and more.
This is the introduction paragraph with bold text and italics.
This is a blockquote from the articleLink text
sh Code Example
Connect to our MCP clients for seamless integration.
` | ISO country code for geo-targeting (e.g. `us`, `de`, `jp`) |
| `--zone ` | Web Unlocker zone name |
| `--mobile` | Use a mobile user agent |
| `--async` | Submit async, return a snapshot ID |
| `-o, --output ` | Write output to file |
| `--json` | Force JSON output |
| `--pretty` | Pretty-print JSON output |
```bash Markdown (default) theme={null}
brightdata scrape https://news.ycombinator.com
```
```bash HTML theme={null}
brightdata scrape https://example.com -f html
```
```bash JSON with geo-targeting theme={null}
brightdata scrape https://amazon.com -f json --country us -o product.json
```
```bash Screenshot theme={null}
brightdata scrape https://example.com -f screenshot -o page.png
```
```bash Async mode theme={null}
brightdata scrape https://example.com --async
```
```bash Pipe to markdown reader theme={null}
brightdata scrape https://docs.github.com | glow -
```
***
## `brightdata search `
Search Google, Bing, or Yandex via Bright Data's SERP API. Google returns structured JSON with organic results, ads, People Also Ask, and related searches. Bing and Yandex return markdown by default.
| Flag | Description |
| --------------------- | --------------------------------------------- |
| `--engine ` | `google` (default), `bing`, `yandex` |
| `--country ` | Localized results (e.g. `us`, `de`) |
| `--language ` | Language code (e.g. `en`, `fr`) |
| `--page ` | Page number, 0-indexed (default: `0`) |
| `--type ` | `web` (default), `news`, `images`, `shopping` |
| `--device ` | `desktop`, `mobile` |
| `--zone ` | SERP zone name |
| `-o, --output ` | Write output to file |
| `--json` | Force JSON output |
| `--pretty` | Pretty-print JSON output |
```bash Basic search theme={null}
brightdata search "typescript best practices"
```
```bash Localized search theme={null}
brightdata search "restaurants berlin" --country de --language de
```
```bash News search theme={null}
brightdata search "AI regulation" --type news
```
```bash Pagination theme={null}
brightdata search "web scraping" --page 1
```
```bash Extract URLs with jq theme={null}
brightdata search "open source scraping" --json | jq -r '.organic[].link'
```
```bash Bing search theme={null}
brightdata search "bright data pricing" --engine bing
```
***
## `brightdata discover `
AI-powered web discovery. Submit a query with optional intent, and Bright Data finds, ranks, and optionally extracts full-page content for each result.
| Flag | Description |
| --------------------------- | ---------------------------------------------------- |
| `--intent ` | AI intent to evaluate and rank result relevance |
| `--country ` | ISO country code (default: `US`) |
| `--city ` | City for localized results (e.g. `"New York"`) |
| `--language ` | Language code (default: `en`) |
| `--num-results ` | Number of results to return |
| `--filter-keywords ` | Comma-separated keywords that must appear in results |
| `--include-content` | Include full page content (markdown) in each result |
| `--no-remove-duplicates` | Keep duplicate results |
| `--start-date ` | Only content updated from date (`YYYY-MM-DD`) |
| `--end-date ` | Only content updated until date (`YYYY-MM-DD`) |
| `--timeout ` | Polling timeout (default: `600`) |
| `-o, --output ` | Write output to file |
| `--json` / `--pretty` | JSON output (raw / indented) |
```bash Basic discovery theme={null}
brightdata discover "AI trends"
```
```bash With AI intent for relevance ranking theme={null}
brightdata discover "AI trends" \
--intent "Prioritize institutional reports for VC research"
```
```bash Include full page content theme={null}
brightdata discover "AI trends" --include-content --num-results 5
```
```bash Geo-targeted with date range theme={null}
brightdata discover "best restaurants" --country US --city "New York" \
--start-date 2025-01-01 --end-date 2025-12-31
```
```bash Filter by keywords theme={null}
brightdata discover "generative AI SaaS" --filter-keywords "revenue,SaaS"
```
```bash JSON output to file theme={null}
brightdata discover "AI trends" --num-results 10 --pretty -o results.json
```
For best results with `--intent`, use a structured formula: describe your persona, what to prioritize, the depth of analysis, and what to exclude. See the [Discover API reference](/api-reference/discover) for detailed guidance.
***
## `brightdata pipelines [params...] [options]`
Extract structured data from 40+ platforms. Triggers an async collection job, polls until results are ready, and returns the data.
| Flag | Description |
| --------------------- | ------------------------------------------ |
| `--format ` | `json` (default), `csv`, `ndjson`, `jsonl` |
| `--timeout ` | Polling timeout (default: `600`) |
| `-o, --output ` | Write output to file |
| `--json` | Force JSON output |
| `--pretty` | Pretty-print JSON output |
```bash theme={null}
# List all available pipeline types
brightdata pipelines list
```
```bash LinkedIn profile theme={null}
brightdata pipelines linkedin_person_profile "https://linkedin.com/in/username"
```
```bash Amazon product theme={null}
brightdata pipelines amazon_product "https://amazon.com/dp/B09V3KXJPB"
```
```bash Amazon search theme={null}
brightdata pipelines amazon_product_search "laptop" "https://amazon.com"
```
```bash Instagram profile theme={null}
brightdata pipelines instagram_profiles "https://instagram.com/username"
```
```bash YouTube comments theme={null}
brightdata pipelines youtube_comments "https://youtube.com/watch?v=..." 50
```
```bash CSV export theme={null}
brightdata pipelines amazon_product "https://amazon.com/dp/..." --format csv -o product.csv
```
### Supported platforms
| Type | Platform | Parameters |
| ------------------------ | ---------------------- | ------------------------ |
| `amazon_product` | Amazon product page | `` |
| `amazon_product_reviews` | Amazon reviews | `` |
| `amazon_product_search` | Amazon search results | ` ` |
| `walmart_product` | Walmart product page | `` |
| `walmart_seller` | Walmart seller profile | `` |
| `ebay_product` | eBay listing | `` |
| `bestbuy_products` | Best Buy | `` |
| `etsy_products` | Etsy | `` |
| `homedepot_products` | Home Depot | `` |
| `zara_products` | Zara | `` |
| `google_shopping` | Google Shopping | `` |
| Type | Platform | Parameters |
| -------------------------- | ---------------------- | -------------------------------- |
| `linkedin_person_profile` | LinkedIn person | `` |
| `linkedin_company_profile` | LinkedIn company | `` |
| `linkedin_job_listings` | LinkedIn jobs | `` |
| `linkedin_posts` | LinkedIn posts | `` |
| `linkedin_people_search` | LinkedIn people search | ` ` |
| `crunchbase_company` | Crunchbase | `` |
| `zoominfo_company_profile` | ZoomInfo | `` |
| Type | Platform | Parameters |
| ------------------------------- | -------------------- | ---------------------- |
| `instagram_profiles` | Instagram profiles | `` |
| `instagram_posts` | Instagram posts | `` |
| `instagram_reels` | Instagram reels | `` |
| `instagram_comments` | Instagram comments | `` |
| `facebook_posts` | Facebook posts | `` |
| `facebook_marketplace_listings` | Facebook Marketplace | `` |
| `facebook_company_reviews` | Facebook reviews | ` [num_reviews]` |
| `facebook_events` | Facebook events | `` |
| `tiktok_profiles` | TikTok profiles | `` |
| `tiktok_posts` | TikTok posts | `` |
| `tiktok_shop` | TikTok shop | `` |
| `tiktok_comments` | TikTok comments | `` |
| `x_posts` | X (Twitter) posts | `` |
| `youtube_profiles` | YouTube channels | `` |
| `youtube_videos` | YouTube videos | `` |
| `youtube_comments` | YouTube comments | ` [num_comments]` |
| `reddit_posts` | Reddit posts | `` |
| Type | Platform | Parameters |
| --------------------------- | ----------------------- | -------------------- |
| `google_maps_reviews` | Google Maps reviews | ` [days_limit]` |
| `google_play_store` | Google Play | `` |
| `apple_app_store` | Apple App Store | `` |
| `github_repository_file` | GitHub repository files | `` |
| `yahoo_finance_business` | Yahoo Finance | `` |
| `zillow_properties_listing` | Zillow | `` |
| `booking_hotel_listings` | Booking.com | `` |
Run `brightdata pipelines list` in your terminal to see all available types at any time.
***
## `brightdata status `
Check the status of an async snapshot job (from `--async` scrapes or pipeline collections).
| Flag | Description |
| --------------------- | -------------------------------- |
| `--wait` | Poll until the job completes |
| `--timeout ` | Polling timeout (default: `600`) |
| `-o, --output ` | Write output to file |
| `--json` / `--pretty` | JSON output |
```bash theme={null}
brightdata status s_abc123xyz
brightdata status s_abc123xyz --wait --pretty
brightdata status s_abc123xyz --wait --timeout 300
```
***
## `brightdata browser`
Control a real browser session powered by Bright Data's [Scraping Browser](/scraping-automation/scraping-browser). A lightweight local daemon holds the browser connection open between commands, giving you persistent state without reconnecting on every call.
```bash theme={null}
brightdata browser [options]
```
### Global flags
These flags work with every `browser` subcommand:
| Flag | Description |
| --------------------- | ------------------------------------------------------------------------------------ |
| `--session ` | Session name for running multiple isolated sessions in parallel (default: `default`) |
| `--country ` | Geo-target by ISO country code. On `open`, changing country reconnects the browser |
| `--zone ` | Scraping Browser zone (default: `cli_browser`) |
| `--timeout ` | IPC command timeout in milliseconds (default: `30000`) |
| `--idle-timeout ` | Daemon auto-shutdown after idle (default: `600000` / 10 min) |
| `--json` / `--pretty` | JSON output |
| `-o, --output ` | Write output to file |
### Subcommands
Navigate to a URL. Starts the daemon and browser session automatically if not already running.
```bash theme={null}
brightdata browser open
brightdata browser open https://amazon.com --country us --session shop
```
| Flag | Description |
| --------------------- | ----------------------------------------------------------------------------------- |
| `--country ` | Geo-targeting. Reconnects the browser if the country changes on an existing session |
| `--zone ` | Browser zone name |
| `--idle-timeout ` | Daemon idle timeout for this session |
Capture the page as a text accessibility tree. This is the primary way AI agents read page content - far more token-efficient than raw HTML.
Each interactive element gets a `ref` (e.g. `e1`, `e2`) that you pass to `click`, `type`, `fill`, and other interaction commands.
```bash theme={null}
brightdata browser snapshot
brightdata browser snapshot --compact # Interactive elements + ancestors only
brightdata browser snapshot --interactive # Interactive elements as a flat list
brightdata browser snapshot --depth 3 # Limit tree depth
brightdata browser snapshot --selector "main" # Scope to a CSS subtree
```
**Example output:**
```text theme={null}
Page: Example Domain
URL: https://example.com
- heading "Example Domain" [level=1]
- paragraph "This domain is for use in illustrative examples."
- link "More information..." [ref=e1]
```
| Flag | Description |
| ------------------ | ------------------------------------------------------------------------------- |
| `--compact` | Only interactive elements and their ancestors (70-90% fewer tokens) |
| `--interactive` | Only interactive elements, as a flat list |
| `--depth ` | Limit tree depth |
| `--selector ` | Scope snapshot to elements matching a CSS selector |
| `--wrap` | Wrap output in content boundaries (useful for AI agent prompt injection safety) |
Capture a PNG screenshot of the current viewport.
```bash theme={null}
brightdata browser screenshot
brightdata browser screenshot ./result.png
brightdata browser screenshot --full-page -o page.png
brightdata browser screenshot --base64
```
| Flag | Description |
| ------------- | ---------------------------------------------------------- |
| `[path]` | Where to save the PNG (default: temp directory) |
| `--full-page` | Capture the full scrollable page, not just the viewport |
| `--base64` | Output base64-encoded PNG data instead of saving to a file |
Interact with elements using their snapshot `ref` values.
```bash theme={null}
# Click an element
brightdata browser click e3
# Type text into a field (clears first by default)
brightdata browser type e5 "search query"
brightdata browser type e5 " more text" --append # Append to existing value
brightdata browser type e5 "search query" --submit # Press Enter after typing
# Fill a form field directly (no keyboard simulation)
brightdata browser fill e2 "user@example.com"
# Select a dropdown option by visible label
brightdata browser select e4 "United States"
# Check / uncheck a checkbox or radio button
brightdata browser check e7
brightdata browser uncheck e7
# Hover over an element
brightdata browser hover e2
```
| Flag (for `type`) | Description |
| ----------------- | ---------------------------------------------------- |
| `--append` | Append to existing value using key-by-key simulation |
| `--submit` | Press Enter after typing |
Scroll the viewport or scroll an element into view.
```bash theme={null}
brightdata browser scroll # Scroll down 300px (default)
brightdata browser scroll --direction up
brightdata browser scroll --direction down --distance 600
brightdata browser scroll --ref e10 # Scroll element into view
```
| Flag | Description |
| ------------------- | ----------------------------------------------------- |
| `--direction ` | `up`, `down`, `left`, `right` (default: `down`) |
| `--distance ` | Pixels to scroll (default: `300`) |
| `--ref ` | Scroll this element into view instead of the viewport |
Get text or HTML content from the page or a specific element.
```bash theme={null}
# Text content
brightdata browser get text # Full page text
brightdata browser get text "h1" # Text of the first h1
brightdata browser get text "#price" # Text inside #price
# HTML content
brightdata browser get html # Full page outer HTML
brightdata browser get html ".product" # innerHTML of .product
```
Inspect session state.
```bash theme={null}
# HTTP requests captured since last navigation
brightdata browser network
# Cookies for the active session
brightdata browser cookies
# Current session state
brightdata browser status
brightdata browser status --session shop --pretty
# List all active sessions
brightdata browser sessions
```
Navigation controls.
```bash theme={null}
brightdata browser back
brightdata browser forward
brightdata browser reload
```
Close a session and stop its daemon.
```bash theme={null}
brightdata browser close # Close the default session
brightdata browser close --session shop # Close a named session
brightdata browser close --all # Close all active sessions
```
Element `ref` values (e.g. `e1`, `e3`) are re-assigned on every `snapshot` call. After navigating or clicking, take a fresh snapshot before using refs again.
***
## `brightdata zones`
List and inspect Bright Data proxy zones.
```bash theme={null}
brightdata zones # List all active zones
brightdata zones info # Full details for a zone
brightdata zones --json -o zones.json # Export as JSON
brightdata zones info my_zone --pretty # Pretty-print zone info
```
***
## `brightdata budget`
View account balance and per-zone cost/bandwidth. Read-only.
| Subcommand | Description |
| ------------- | -------------------------------------- |
| *(none)* | Quick account balance |
| `balance` | Balance + pending charges |
| `zones` | Cost & bandwidth table for all zones |
| `zone ` | Detailed cost & bandwidth for one zone |
| Flag | Description |
| --------------------- | ------------------------------------------------ |
| `--from ` | Start of date range (e.g. `2024-01-01T00:00:00`) |
| `--to ` | End of date range |
| `--json` / `--pretty` | JSON output |
```bash theme={null}
brightdata budget
brightdata budget balance
brightdata budget zones
brightdata budget zone my_zone
brightdata budget zones --from 2024-01-01T00:00:00 --to 2024-02-01T00:00:00
```
***
## `brightdata config`
View and manage CLI configuration.
| Subcommand | Description |
| ------------------- | ------------------ |
| *(none)* | Show all config |
| `get ` | Get a single value |
| `set ` | Set a value |
| Config Key | Description |
| ----------------------- | ------------------------------------------- |
| `default_zone_unlocker` | Default zone for `scrape` and `search` |
| `default_zone_serp` | Override zone for `search` only |
| `default_format` | Default output format: `markdown` or `json` |
| `api_url` | Override API base URL |
```bash theme={null}
brightdata config
brightdata config set default_zone_unlocker my_zone
brightdata config set default_format json
brightdata config get default_zone_unlocker
```
***
## `brightdata init`
Interactive setup wizard. Walks through authentication, zone selection, and default configuration.
| Flag | Description |
| --------------------- | ---------------------------- |
| `--skip-auth` | Skip the authentication step |
| `-k, --api-key ` | Provide API key directly |
```bash theme={null}
brightdata init
```
***
## `brightdata skill`
Install Bright Data AI agent skills into coding agents (Claude Code, Cursor, Copilot, etc.).
| Subcommand | Description |
| ------------ | -------------------------------------------------- |
| `add` | Interactive picker - choose skills + target agents |
| `add ` | Install a specific skill directly |
| `list` | List all available skills |
Available skills: `search`, `scrape`, `data-feeds`, `bright-data-mcp`, `bright-data-best-practices`
```bash theme={null}
brightdata skill add # Interactive picker
brightdata skill add scrape # Direct install
brightdata skill list # See what's available
```
***
## `brightdata add mcp`
Add the Bright Data MCP server to Claude Code, Cursor, or Codex. Uses the API key stored by `brightdata login`.
```bash theme={null}
brightdata add mcp # Interactive agent + scope prompts
brightdata add mcp --agent claude-code --global
brightdata add mcp --agent claude-code,cursor --project
brightdata add mcp --agent codex --global
```
| Flag | Description |
| ------------------ | --------------------------------------------------------- |
| `--agent ` | Comma-separated targets: `claude-code`, `cursor`, `codex` |
| `--global` | Install to the agent's global config file |
| `--project` | Install to the current project's config file |
### Config file locations
| Agent | Global path | Project path |
| ----------- | --------------------------------------------- | ----------------------- |
| Claude Code | `~/.claude.json` | `.claude/settings.json` |
| Cursor | `~/.cursor/mcp.json` | `.cursor/mcp.json` |
| Codex | `$CODEX_HOME/mcp.json` or `~/.codex/mcp.json` | Not supported |
The command writes the MCP server entry under `mcpServers["bright-data"]`. Existing config is preserved - only the `bright-data` key is added or replaced.
`brightdata add mcp` uses the API key stored by `brightdata login`. It does not read `BRIGHTDATA_API_KEY` or the `--api-key` flag, so run `brightdata login` first.
# Usage Examples
Source: https://docs.brightdata.com/cli/examples
Real-world workflows and recipes for the Bright Data CLI (660+ scrapers), from 1-line scrapes returning clean markdown to scheduled pipelines feeding LLMs.
## How to scrape with the CLI
### Get clean content from any website
```bash theme={null}
# Clean markdown - great for reading or feeding to LLMs
brightdata scrape https://news.ycombinator.com
# Save to file
brightdata scrape https://docs.python.org/3/tutorial/ -o python-tutorial.md
# Get raw HTML for parsing
brightdata scrape https://example.com -f html -o page.html
```
### How to scrape with geo-targeting
```bash theme={null}
# See Amazon prices as a US customer
brightdata scrape https://amazon.com/dp/B09V3KXJPB --country us
# Scrape a German news site from Germany
brightdata scrape https://spiegel.de --country de
# Mobile user agent for mobile-optimized pages
brightdata scrape https://example.com --mobile
```
### How to take screenshots
```bash theme={null}
# Full-page screenshot
brightdata scrape https://example.com -f screenshot -o homepage.png
# Screenshot from a specific country
brightdata scrape https://amazon.co.uk -f screenshot --country gb -o uk-amazon.png
```
***
## How to run search workflows
### Basic web search
```bash theme={null}
# Google search with formatted table output
brightdata search "best web scraping tools 2025"
# Get raw JSON for processing
brightdata search "typescript best practices" --json
# Pretty-print for inspection
brightdata search "AI startups" --pretty
```
### Localized and specialized search
```bash theme={null}
# Local restaurant search from Germany, in German
brightdata search "restaurants berlin" --country de --language de
# News-only results
brightdata search "AI regulation 2025" --type news
# Shopping results
brightdata search "wireless headphones" --type shopping
# Image search
brightdata search "mountain landscape wallpaper" --type images
```
### Pagination
```bash theme={null}
# First page (default)
brightdata search "web scraping tutorials"
# Second page
brightdata search "web scraping tutorials" --page 1
# Third page
brightdata search "web scraping tutorials" --page 2
```
***
## How to run discovery workflows
### How to run AI-powered research
```bash theme={null}
# Basic discovery with formatted table output
brightdata discover "AI trends"
# With AI intent for targeted relevance ranking
brightdata discover "AI trends" \
--intent "Prioritize institutional reports for VC research"
# Include full page content for each result
brightdata discover "AI trends" --include-content --num-results 5
```
### How to discover with geo-targeting
```bash theme={null}
# Local results with date filtering
brightdata discover "best restaurants" --country US --city "New York" \
--start-date 2025-01-01 --end-date 2025-12-31
# Filter results by required keywords
brightdata discover "generative AI SaaS" --filter-keywords "revenue,SaaS"
```
### Export discovery results
```bash theme={null}
# Save as JSON
brightdata discover "AI trends" --num-results 10 --pretty -o results.json
# Pipe-friendly - redirected stdout outputs JSON automatically
brightdata discover "AI trends" --include-content --num-results 3 > results.json
```
***
## How to run browser workflows
### Navigate and read pages
```bash theme={null}
# Open a page and read its content
brightdata browser open https://example.com
brightdata browser snapshot
# Compact snapshot for AI agents (70-90% fewer tokens)
brightdata browser snapshot --compact
# Scope to a specific section
brightdata browser snapshot --selector "main"
```
### Interact with pages
```bash theme={null}
# Open a page
brightdata browser open https://example.com
# Read the page structure
brightdata browser snapshot --compact
# Click, type, and submit using refs from the snapshot
brightdata browser click e3
brightdata browser type e5 "search query" --submit
# Get updated snapshot after interaction
brightdata browser snapshot --compact
# Take a screenshot for visual verification
brightdata browser screenshot ./result.png
# Clean up
brightdata browser close
```
### How to compare multiple sessions
```bash theme={null}
# Open the same page from different countries
brightdata browser open https://amazon.com --session us --country us
brightdata browser open https://amazon.com --session de --country de
# Capture both snapshots
brightdata browser snapshot --session us --json > us.json
brightdata browser snapshot --session de --json > de.json
# Close all sessions
brightdata browser close --all
```
### Extract content from dynamic pages
```bash theme={null}
# Navigate to a page that requires JavaScript
brightdata browser open https://example.com/dashboard
# Get text content from specific elements
brightdata browser get text "#total-revenue"
brightdata browser get text ".summary-table"
# Get HTML for parsing
brightdata browser get html ".product-grid"
# Check network requests
brightdata browser network
```
***
## Structured data extraction
### How to extract e-commerce data
```bash theme={null}
# Amazon product details
brightdata pipelines amazon_product "https://amazon.com/dp/B09V3KXJPB"
# Amazon product reviews
brightdata pipelines amazon_product_reviews "https://amazon.com/dp/B09V3KXJPB"
# Amazon search - requires keyword + domain
brightdata pipelines amazon_product_search "wireless headphones" "https://amazon.com"
# Walmart product
brightdata pipelines walmart_product "https://walmart.com/ip/123456"
# Export as CSV
brightdata pipelines amazon_product "https://amazon.com/dp/B09V3KXJPB" --format csv -o product.csv
```
### Social media profiles
```bash theme={null}
# LinkedIn person
brightdata pipelines linkedin_person_profile "https://linkedin.com/in/username"
# LinkedIn company
brightdata pipelines linkedin_company_profile "https://linkedin.com/company/bright-data"
# Instagram profile
brightdata pipelines instagram_profiles "https://instagram.com/username"
# TikTok profile
brightdata pipelines tiktok_profiles "https://tiktok.com/@username"
```
### Reviews and comments
```bash theme={null}
# Google Maps reviews - last 7 days
brightdata pipelines google_maps_reviews "https://maps.google.com/maps/place/..." 7
# YouTube comments - top 50
brightdata pipelines youtube_comments "https://youtube.com/watch?v=dQw4w9WgXcQ" 50
# Facebook company reviews - 25 reviews
brightdata pipelines facebook_company_reviews "https://facebook.com/company" 25
# Instagram comments
brightdata pipelines instagram_comments "https://instagram.com/p/ABC123"
```
***
## Piping and automation
The CLI is designed to be pipe-friendly. When stdout is not a TTY, colors and spinners are automatically disabled.
### Chain search → scrape
```bash theme={null}
# Search Google, extract the first URL, then scrape it
brightdata search "best python frameworks 2025" --json \
| jq -r '.organic[0].link' \
| xargs brightdata scrape
```
### Scrape and read in terminal
```bash theme={null}
# Pipe markdown output to a terminal reader
brightdata scrape https://docs.github.com | glow -
# Or use less
brightdata scrape https://docs.github.com | less
```
### Export to CSV for analysis
```bash theme={null}
# Amazon product data to CSV
brightdata pipelines amazon_product "https://amazon.com/dp/B09V3KXJPB" --format csv > product.csv
# LinkedIn jobs to CSV
brightdata pipelines linkedin_job_listings "https://linkedin.com/jobs/view/123" --format csv -o jobs.csv
```
### Extract specific fields with jq
```bash theme={null}
# Get just the titles and prices from Amazon search
brightdata pipelines amazon_product_search "laptop" "https://amazon.com" \
| jq '[.[] | {title, price: .final_price}]'
# Get just organic result URLs from search
brightdata search "web scraping" --json | jq -r '.organic[].link'
```
### Async jobs for heavy workloads
```bash theme={null}
# Submit an async scrape
JOB_ID=$(brightdata scrape https://heavy-page.com --async --json | jq -r '.snapshot_id')
# Do other work...
# Check back later
brightdata status $JOB_ID --wait --pretty
```
***
## How to manage your account
### Monitor costs
```bash theme={null}
# Quick balance check
brightdata budget
# Detailed balance with pending charges
brightdata budget balance
# Cost breakdown by zone
brightdata budget zones
# Specific zone in a date range
brightdata budget zone cli_unlocker --from 2024-01-01T00:00:00 --to 2024-02-01T00:00:00
```
### Manage configuration
```bash theme={null}
# View current config
brightdata config
# Set default output to JSON
brightdata config set default_format json
# Use a custom zone for scraping
brightdata config set default_zone_unlocker my_custom_zone
# Override SERP zone
brightdata config set default_zone_serp my_serp_zone
```
***
## AI agent integration
### Add the MCP server to your coding agent
```bash theme={null}
# Interactive - choose agent and scope
brightdata add mcp
# Add to Claude Code globally
brightdata add mcp --agent claude-code --global
# Add to multiple agents at once
brightdata add mcp --agent claude-code,cursor --project
# Add to Codex
brightdata add mcp --agent codex --global
```
### Install skills into coding agents
```bash theme={null}
# Interactive picker - choose skills and target agents
brightdata skill add
# Install the scraping skill into your agent
brightdata skill add scrape
# Install search capabilities
brightdata skill add search
# See all available skills
brightdata skill list
```
Skills are pre-packaged bundles of prompts and configuration that teach AI coding agents how to use Bright Data effectively. See [Skills](/ai/for-agents/skills) for more details.
***
## Supported environment variables
Override any stored configuration with environment variables:
| Variable | Purpose |
| ---------------------------- | ------------------------------------------------------ |
| `BRIGHTDATA_API_KEY` | API key (skips login entirely) |
| `BRIGHTDATA_UNLOCKER_ZONE` | Default Web Unlocker zone |
| `BRIGHTDATA_SERP_ZONE` | Default SERP zone |
| `BRIGHTDATA_POLLING_TIMEOUT` | Polling timeout in seconds |
| `BRIGHTDATA_BROWSER_ZONE` | Default Scraping Browser zone (default: `cli_browser`) |
```bash theme={null}
# Use in CI/CD without login
BRIGHTDATA_API_KEY=your_key brightdata scrape https://example.com
# Override timeout for large pipeline jobs
BRIGHTDATA_POLLING_TIMEOUT=1200 brightdata pipelines amazon_product "https://amazon.com/dp/..."
```
# Bright Data CLI FAQs
Source: https://docs.brightdata.com/cli/faqs
FAQs and troubleshooting for the Bright Data CLI: brightdata and bdata aliases, authentication, installation issues and common errors across 660+ scrapers.
## Frequently asked general questions
They are identical - `bdata` is a shorthand alias installed alongside `brightdata`. Use whichever you prefer.
Yes. The CLI connects to Bright Data's infrastructure to handle web requests. You can [sign up for free](https://brightdata.com/?hs_signup=1\&utm_source=docs) and get started with the free tier.
The CLI supports 40+ platforms through the `pipelines` command, including Amazon, LinkedIn, Instagram, TikTok, YouTube, Facebook, Reddit, Google Maps, Walmart, eBay, and many more. Run `brightdata pipelines list` to see the full list.
* **Scrape:** `markdown` (default), `html`, `json`, `screenshot`
* **Search:** Formatted table (default), `json`, `pretty`
* **Pipelines:** `json` (default), `csv`, `ndjson`, `jsonl`
All commands support `-o ` to write output to a file.
Yes. The CLI is fully pipe-friendly. When stdout is not a TTY, colors and spinners are automatically disabled. Use `--json` for machine-readable output and `BRIGHTDATA_API_KEY` environment variable for non-interactive authentication.
## Authentication
Credentials are stored locally on your machine:
| OS | Path |
| ------- | --------------------------------------------------------------- |
| macOS | `~/Library/Application Support/brightdata-cli/credentials.json` |
| Linux | `~/.config/brightdata-cli/credentials.json` |
| Windows | `%APPDATA%\brightdata-cli\credentials.json` |
The file is set to mode `0o600` (owner read/write only).
Use the device flow:
```bash theme={null}
brightdata login --device
```
This prints a URL and a verification code. Open the URL on any device with a browser, enter the code, and authentication completes on the server.
Run `brightdata logout` followed by `brightdata login` with the new account. Or pass a different API key directly:
```bash theme={null}
brightdata login --api-key
```
## Troubleshooting
This means no default zone is configured. Fix it by either:
```bash theme={null}
# Re-run login (auto-creates zones)
brightdata login
# Or set a zone manually
brightdata config set default_zone_unlocker
```
Your stored API key is no longer valid. Re-authenticate:
```bash theme={null}
brightdata login
```
Your API key does not have permission for the requested zone or operation. Check zone permissions in the [Bright Data control panel](https://brightdata.com/cp).
You've hit the rate limit for your zone. Options:
* Wait a moment and retry
* Use `--async` for large jobs to avoid blocking
* Contact your account manager to increase limits
The default polling timeout is 600 seconds (10 minutes). For large datasets, increase it:
```bash theme={null}
# Per-command
brightdata pipelines amazon_product "" --timeout 1200
# Via environment variable
export BRIGHTDATA_POLLING_TIMEOUT=1200
```
The CLI auto-detects TTY support. If detection fails, pipe through `cat` to force plain output:
```bash theme={null}
brightdata scrape https://example.com | cat
```
Or use `--json` for clean, parseable output.
## Frequently asked configuration questions
```bash theme={null}
brightdata config set default_format json
```
Valid values: `markdown`, `json`.
Settings are resolved in this order (highest priority first):
1. **CLI flags** - e.g., `--zone my_zone`
2. **Environment variables** - e.g., `BRIGHTDATA_UNLOCKER_ZONE`
3. **config.json** - e.g., `default_zone_unlocker`
4. **Defaults** - built-in fallbacks
Delete the configuration directory:
```bash theme={null}
# Linux
rm -rf ~/.config/brightdata-cli/
# macOS
rm -rf ~/Library/Application\ Support/brightdata-cli/
```
Then run `brightdata login` to start fresh.
# Installation & Setup
Source: https://docs.brightdata.com/cli/installation
Install the Bright Data CLI (660+ scrapers) via npm or curl and authenticate your account with an API key in 1 minute. Includes setup verification commands.
## Install the CLI
```bash npm (recommended) theme={null}
npm install -g @brightdata/cli
```
```bash yarn theme={null}
yarn global add @brightdata/cli
```
```bash pnpm theme={null}
pnpm add -g @brightdata/cli
```
Verify the installation:
```bash theme={null}
brightdata --version
```
The shorthand alias `bdata` is also available after installation - use whichever you prefer.
## Authenticate
Run the login command to connect your Bright Data account:
```bash theme={null}
brightdata login
```
This opens your browser for secure OAuth authentication. Once complete, the CLI:
1. Validates and stores your API key locally
2. Auto-creates required proxy zones (`cli_unlocker`, `cli_browser`)
3. Sets sensible defaults so you can start immediately
You only need to log in **once**. All subsequent commands authenticate automatically.
### Alternative authentication methods
When no browser is available, use the device flow:
```bash theme={null}
brightdata login --device
```
This prints a URL and a code. Open the URL on any device, enter the code, and the CLI completes authentication.
For CI/CD pipelines or non-interactive environments, pass your API key directly:
```bash theme={null}
brightdata login --api-key YOUR_API_KEY
```
You can find your API key in the [Bright Data control panel](https://brightdata.com/cp/setting).
Set the `BRIGHTDATA_API_KEY` environment variable to skip login entirely:
```bash theme={null}
export BRIGHTDATA_API_KEY=YOUR_API_KEY
```
This is useful for Docker containers, GitHub Actions, and other automated environments.
## Interactive setup wizard
For a guided first-time experience, use the init command:
```bash theme={null}
brightdata init
```
This walks you through authentication, zone selection, and default configuration step by step.
| Flag | Description |
| --------------------- | --------------------------------------------------- |
| `--skip-auth` | Skip the authentication step (if already logged in) |
| `-k, --api-key ` | Provide API key directly |
## Verify your setup
After logging in, confirm everything is working:
```bash theme={null}
# Check your configuration
brightdata config
# Verify API connectivity
brightdata budget
# Try a quick scrape
brightdata scrape https://example.com
```
## Where configuration is stored
The CLI stores credentials and configuration locally:
| OS | Path |
| ------- | ----------------------------------------------- |
| macOS | `~/Library/Application Support/brightdata-cli/` |
| Linux | `~/.config/brightdata-cli/` |
| Windows | `%APPDATA%\brightdata-cli\` |
Two files are created:
| File | Purpose | Permissions |
| ------------------ | --------------------------------- | -------------------- |
| `credentials.json` | API key | `0o600` (owner-only) |
| `config.json` | Zones, output format, preferences | Standard |
**Priority order for configuration:** CLI flags → Environment variables → `config.json` → Defaults. You can always override any setting on a per-command basis.
## Next steps
Explore the full command reference.
Jump into real-world workflows and recipes.
# Bright Data CLI
Source: https://docs.brightdata.com/cli/overview
Scrape websites, search the web, extract structured data from 40+ platforms, and manage your Bright Data account from the terminal.
Install the CLI, log in once, and start scraping - CAPTCHAs, anti-bot protections, and JavaScript rendering are handled automatically.
## What can you do with the CLI?
Get clean markdown, HTML, JSON, or screenshots from any URL - anti-bot bypass and JS rendering included.
Query Google, Bing, or Yandex and get structured results with organic listings, ads, and People Also Ask.
AI-powered web discovery - find, rank, and extract content by intent with relevance scoring.
Pull product details, profiles, reviews, and more from 40+ platforms like Amazon, LinkedIn, Instagram, and TikTok.
Navigate, click, type, screenshot, and read pages using a real remote browser session with persistent state.
Install skills or add the Bright Data MCP server into Claude Code, Cursor, Codex, and other AI coding agents.
List proxy zones, inspect configurations, and monitor account balance and per-zone costs.
JSON output, file export, and pipe-friendly design make it easy to chain commands and build workflows.
## What the CLI looks like
```bash theme={null}
# Scrape a page as clean markdown
brightdata scrape https://news.ycombinator.com
# Search Google and get structured results
brightdata search "best web scraping tools 2025"
# AI-powered discovery with relevance ranking
brightdata discover "AI trends" --intent "Prioritize institutional reports for VC research"
# Extract an Amazon product as JSON
brightdata pipelines amazon_product "https://amazon.com/dp/B09V3KXJPB" --format json
# Control a remote browser session
brightdata browser open https://example.com
brightdata browser snapshot --compact
# Add Bright Data MCP server to your coding agent
brightdata add mcp
```
The CLI alias `bdata` is available as a shorthand - e.g., `bdata scrape https://example.com`.
## How it works
The Bright Data CLI wraps the full Bright Data web data platform into simple terminal commands. Under the hood, it:
1. **Authenticates once** via OAuth, device flow, or API key - credentials are stored locally and never need to be entered again
2. **Auto-provisions zones** (`cli_unlocker`, `cli_browser`) on first login so you can start immediately
3. **Routes requests** through Bright Data's infrastructure, handling CAPTCHAs, bot detection, IP rotation, and JavaScript rendering
4. **Returns clean output** - formatted tables in the terminal, or structured JSON/CSV/markdown for automation
## Explore CLI resources
Install the CLI and authenticate with Bright Data.
Full reference for every command, flag, and option.
Real-world workflows and recipes for common tasks.
Answers to common questions and troubleshooting tips.
**Zero configuration required.** After a one-time `brightdata login`, every command works out of the box - no tokens to manage, no zones to create, no proxies to configure.
# Archive API reference
Source: https://docs.brightdata.com/datasets/archive/api-reference
API reference for Bright Data's Web Archive search, including endpoints for running queries and retrieving archived web data.
# How data range affects delivery time
Source: https://docs.brightdata.com/datasets/archive/data-range-vs-delivery-time
Understand how your selected data range affects delivery time in Bright Data's Web Archive, including S3 Glacier retrieval delays for older data.
If your query is matching data within **last 24 hours** - your snapshot will start processing/delivering immediately.
If some of your matched data is **older than 24 hours** - it needs to be retrieved from **S3 Glacier Deep Archive** storage tier before delivery, which may take **up to 72 hours**.
Avoid queries that span the retention boundary (approximately 24 hours from now).
Requests with `max_age` or time ranges that fall within \~24h ± 2h of the current time may include files that have already been migrated to archive storage tier. Attempting a dump for such queries can cause the dump to stall or remain incomplete because of files storage class transition.
**Recommendations:**
* We recommend using `max_age` = `24h` for initial testing to ensure fast delivery.
* For real-time data needs: use `max_age: "24h"` or a narrower window to avoid the retention edge.
* For historical data (older than 24h): use explicit `min_date`/`max_date` filters rather than `max_age`.
* If a dump appears stalled: we usually retry automatically, please open a ticket if it didn't happen.
# FAQ: Archive API
Source: https://docs.brightdata.com/datasets/archive/faqs
FAQs about the Bright Data Archive API: available data, freshness, refresh cadence, pricing model and 4 supported delivery destinations for archive snapshots.
Archive API is a massive, continuously expanding, cached repository by Bright Data, designed to capture and deliver public web data at scale.
It provides full web pages and metadata, making it ideal for AI training, machine learning, and large-scale data analysis.
Unlike traditional web crawls, Archive API prioritizes **relevance**, **freshness**, and **usability**, giving you access to the most important parts of the internet as they are scraped daily.
Bright Data’s Archive API has already collected \~90 PB of data, covering \~500 billion+ unique URLs from \~380 million domains.
We continue to add \~1 PB of new data every week, alongside \~2 billion unique URLs, making Archive the **largest**, up-to-date, web data repository available - perfect for AI and data-driven applications.
You can start accessing data immediately through our [Archive API](https://docs.brightdata.com/datasets/archive/overview). The API allows you to search, retrieve and filter data snapshots from Archive seamlessly and efficiently.
* Data from the last 24 hours: Will take from within minutes and up to a few hours to deliver (depending on snapshot size)
* Data older than 24 hours: Will take up to 72 hours to process and deliver (depending on snapshot size)
Archive offers three delivery options to ensure seamless integration into your existing workflows:
* **Amazon S3 bucket:** Have your Data Snapshot delivered directly to your S3 bucket.
* **Azure Blob Storage:** Deliver the data snapshot directly to your Azure Blob container.
* **Webhook:** Retrieved via webhook for real-time integration into your systems.
Absolutely! Archive API allows filtering by category, domains, date, languages, and country before retrieving data, ensuring you only get what you need.
When working with large-scale web data, **freshness**, **relevance**, and **accessibility** are key. While Common Crawl provides a broad historical snapshot of the web, Bright Data’s Archive API offers real-time, continuously updated data with advanced filtering and delivery options. Here’s how they compare:
| **Feature** | **Bright Data’s Archive** | **Common Crawl** |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Data Collection** | Continuously captures public web data in real time, providing results as recent as “now.” | Periodic web crawling (not real-time), updated monthly or bimonthly. Data can be outdated |
| **Data Volume** | \~90 PB collected in a few years, covering \~500 billion pages across \~380M domains. Adds \~1 PB and \~2 billion unique URLs/week. | 250b pages collected over 18 years. |
| **Website Coverage & Relevance** | Focuses on high-value, relevant website data based on real scraping business needs. | Crawls indiscriminately, including outdated or low-quality pages. |
| **Data Types** | Full web pages (JS-rendered) | 98.6% HTML and text |
| **Filtering & Delivery** | Full discovery and delivery platform- filtering by category, domain, language, date etc. Delivered via Amazon S3, Azure or webhook. | No built-in filtering or delivery. Need to manually process huge raw WARC files. |
# Web Archive Overview
Source: https://docs.brightdata.com/datasets/archive/overview
The Web Archive gives access to Bright Data's stored web traffic (250+ domains), a growing repository of pages collected through Unlocker and SERP APIs.
## What it does
Instead of running your own crawlers, you search the archive, filter what you need (by time range, domain, URL patterns, language, blocking signals), and export ready-to-use datasets as HTML files + metadata.
## Common use cases
* **LLM training and RAG pipelines**: Build or refresh training corpora from targeted web segments
* **Search and indexing**: Backfill indexes with historical content across large domain sets
* **Search product augmentation**: Improve coverage for sites with advanced blocking, supporting reliable page retrieval at scale
## How it works
Filter by time range, domains, URL patterns, language, or signals (CAPTCHA, robots blocks, etc.)
See matched file count, snapshot size, expected duration, and cost
Export the snapshot as HTML files + metadata (URL, timestamp, collection attributes) to Amazon S3, Azure Blob Storage, or via webhook
# Data validation for Customers
Source: https://docs.brightdata.com/datasets/data-validation/data-validation-for-customers
How Bright Data's automated dataset validation works across 250+ domains for customers: verification checks, error handling and approval flow before delivery.
## Overview
Bright Data’s automated dataset creation platform consists of a verification and approval phase before the dataset delivery. Bright Data facilitates error handling, validation checks, and customization, ensuring data accuracy and reliability. These validation checks are crucial in saving time, reducing errors in the data, and keeping the data quality at the desired level.
## How does it work?
Once the dataset snapshot is ready:
The user will get the dataset with an indication in the Bright Data Control Panel that all tests passed.
The developer reviews the issues and will decide whether to:
1. Fix the dataset according to the failed tests.
2. Deliver the dataset to the user, explaining why the validation test failed but was overridden.
The user can then decide to:
1. Approve the snapshot.
2. Approve the snapshot for this time frame only.
3. Reject the snapshot, and we will fix the scraper accordingly.
When the user approves the dataset snapshot, they proceed to the delivery phase.
## Which validation rules apply
The dataset must contain a certain percentage of unique values.
* Example: In the LinkedIn company profiles dataset, each company's LinkedIn URL should be unique. If duplicate URLs exist, the same company is listed more than once, violating the uniqueness rule.
The dataset must contain a minimum percentage of filled values.
* Example: In a LinkedIn company profiles dataset, at least 90% of the profiles must have the 'Industry' field filled. If more than 10% of the profiles are missing this information (leaving the 'Industry' field blank), the dataset does not meet the required filling rate.
Certain fields must be filled; an error will occur if they remain empty.
* Example: Fields like 'Company Name' and 'Headquarters Location' might be mandatory in the LinkedIn dataset. Any profile lacking this information is flagged as an error.
Compared to previous values collected, the number value must not change by more than X.
* Example: If the dataset is updated regularly, the number of employees for a company shouldn’t drastically change between updates (e.g., a sudden jump from 50 to 5000 employees) unless there's a known reason (like a merger).
Verifies each entry's data type against its field type (e.g., string, number, date) and flags mismatches for correction.
* Example: The dataset should only accept date formats in the 'Foundation Date' field. If a text string like 'unknown' is entered, it should be flagged for correction.
Create a custom rule to validate whether a field exists and its value is valid, such as requiring the size string to be 'S,' 'M,' or 'L.'
* Example: The dataset might have a field for 'Company Size' with accepted values like 'Small,' 'Medium,' and 'Large.' A record must be flagged if it has a value outside these options.
The dataset must have X records (each URL should have X records from the total URL inputs).
* Example: If the dataset aims to represent companies in a specific sector, like technology, it must have a minimum number of company profiles from this sector to be considered complete and representative.
Determines whether dataset size fluctuations are within a +/- X% range.
* Example: For datasets updated monthly, the total number of companies listed shouldn’t fluctuate wildly (e.g., more than 10% increase or decrease) from one month to the next unless a specific event or trend is affecting the industry.
Checks each individual record in the dataset to ensure it doesn't have a high percentage of empty or null fields. If a record has more than a predetermined threshold (e.g., 70%) of its fields empty or null, it triggers an error.
* Example: In the LinkedIn company profiles dataset, if a specific company's profile has more than 70% of its fields (like industry, size, location, description) empty, this rule would flag it as incomplete.
Detects and resolves issues that lead to multiple duplicate records in the dataset due to improper identity assignment or entry errors. It ensures each record is distinct and accurately represents a unique data point.
* Example: In the LinkedIn dataset, this rule would identify cases where the same company is listed multiple times due to errors in assigning unique identifiers. For instance, if slight variations in company profiles (like different spellings of a company name) result in the same company appearing as separate entries, this rule would flag them for correction.
## Main components and functionality
### Evaluating the validation test results
Once the dataset snapshot validation errors are handled, the user is notified to evaluate and choose whether to:
1. Approve
2. Approve temporarily
3. Reject the snapshot.
### How to act on evaluation results
For each failed validation test, the user has three options:
1. **Set a new threshold**
1. Set custom values - If the developer doesn't reach the default set value, the user can choose a new threshold. The snapshot is returned to the developer once a new threshold is set.
2. Set for X% - Accept the success rate the developer reached and set the threshold to the value the developer managed to extract.
2. **Ignore test (one time only)** - Accept the value that the developer extracted just once (the default value will not change for the next dataset snapshot)
3. **Reject** - The user doesn’t accept the adjustments to the failed tests; the issues will be returned to the developer to fix. The status will be marked as “Rejected” for additional fixes and will later on be re-send to the customer for approval.
In case all issues are ignored/approved, click on “Deliver dataset” to deliver the snapshot.
A snapshot delivery will be automatically delivered when pending customer approval for 14 days.
In case all/some issues are rejected, click on “send back to the developer” to send it back for additional fixes.
## Communications and notifications
Users are notified of status updates through their account in the control panel and via email.
# Data validation for Partners
Source: https://docs.brightdata.com/datasets/data-validation/data-validation-for-partners
How Bright Data's data staging and validation process works across 250+ domains for partners: error handling, approval workflows and quality checks.
## Overview
We are introducing a new data staging process, enabling both developers and owners to verify and approve datasets before delivery.
The system will facilitate error handling, validation checks, and customization, ensuring accuracy and reliability.
It will help everyone save time, reduce the number of open tickets, and keep the needed level of quality to maintain the level of quality we want.
## How does it work?
Once the dataset snapshot is ready, we run its validation tests.
The customer will get the dataset snapshot with an indication on the CP that all test was passed.
The partner will review the issues and will choose:
1. To fix the failed tests
2. To deliver the snapshot as it is to the customer (**with an explanation of why this test failed but still override**).
The customer can then decide (per fail test or as a batch):
1. If he is willing to accept it as it is
2. Accept it as it is only for this specific snapshot
3. Reject it and return it to the partner to fix what needs to be fixed
Once the customer approves, the snapshot goes to the delivery phase.
## The validation tests
Ensures a minimum percentage of unique values in the dataset.
Mandates a minimum percentage for filled values.
Makes a field mandatory once filled; triggers an error if left empty afterward.
The value number must not change by more than X amount compared to previous values.
Checks each entry's data type against its field type (e.g., string, number, date) to ensure integrity and flag mismatches for correction before processing.
Establish a custom rule to validate if the specific field exists and the field value is valid, such as requiring the size string to be 'S', 'M', or 'L'; any other value is considered an error.
Requires a minimum of X records for the initial dataset (in specific URL should per the minimum of X records from the total URL inputs)
Validates fluctuations within a +/- X% range.
## **Main components and functionality**
There are three filter options (All results, Passed, Failed) for dataset test view
This will allow you to reparse the cached data
An option to override a test because the data seems healthy/or it makes sense that the specific dataset does not meet the threshold (note! In case of an override, you will need to write an explanation to the customer)
In case the test failed, click on the “show related records” to see example of the records and then on “View\&Edit code” to get to the IDE and start fixing the issues
Reparse:
Reparse involves reprocessing the existing raw data that has already been collected.
It applies the parsing logic, rules, or transformations to the data without fetching fresh data from the web.
Reparse is useful when changes are needed in data structure, new fields are added to parsing logic, or existing data needs to be restructured or re-extracted.
Recrawl:
Recrawl involves revisiting the source website to collect new or updated data.
It fetches fresh data directly from the source to reflect recent changes or additions on the website.
Recrawling is useful when the content on the source website changes frequently, or when up-to-date data is critical for analysis or reporting.
Once you finish working on the needed fix, you’ll be able to Reparse or recrawl according to your needs
This will allow you to run the validation tests again if needed
This will allow you to redirect to the IDE in context in case you need to edit the collector and recrawl (e.g in case there are no records at all)
After reviewing the needed data and fixing/ignoring issues, you should click on this button and send the snapshot to the owner for review. The ignored tests should be equipped with a reasoning to explain why you chose to ignore the test
In case not all rejected test was accepted by the owner, the issues will be sent back to you and will be marked with the “Rejected” label for additional fixes and re-sending to the owner for approval
## Communications and notifications
Status changes and additional notifications will be presented to you as a megaphone on the CP.
## Tickets and bugs
Now, once we introduce the staging process, fixing a collector is not the end of the bug/issue
The process consists of two steps:
1. Fix the collector
2. Fix the snapshot
Hence, the bug processing workflow will be amended to align with the new 2-step process.
Tickets related to validation issues should not be allowed to be marked as “resolved” before the snapshot is delivered to the dataset owner!
## Changes to the flow:
We are removing the “Resolve” button from the option on the ticket
# Code Examples
Source: https://docs.brightdata.com/datasets/deep-lookup/code-examples
Python, Node.js and shell examples for the Bright Data Deep Lookup API: submit a query, poll for results and parse the structured response end-to-end.
```python Python - Complete Research Flow theme={null}
import requests
import time
class DeepLookupAPI:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.brightdata.com/datasets/deep_lookup/v1"
self.headers = {"Authorization": f"Bearer {api_key}"}
def research_with_spec(self, query, columns, limit=100):
# Create detailed specification
spec = {
"name": "companies",
"query": query,
"title": query.replace("Find all ", ""),
"columns": columns
}
# Trigger research
trigger_response = requests.post(
f"{self.base_url}/trigger",
headers=self.headers,
json={
"query": query,
"spec": spec,
"result_limit": limit
}
).json()
request_id = trigger_response["request_id"]
# Poll for completion
while True:
status_response = requests.get(
f"{self.base_url}/request/{request_id}/status",
headers=self.headers
).json()
print(f"Progress: {status_response.get('progress', 0)}%")
if status_response["status"] == "completed":
break
elif status_response["status"] == "failed":
raise Exception("Research failed")
time.sleep(5)
# Get results
results = requests.get(
f"{self.base_url}/request/{request_id}",
headers=self.headers
).json()
return results
def monitor_progress(self, request_id):
"""Monitor detailed progress of a research request"""
while True:
result = requests.get(
f"{self.base_url}/request/{request_id}",
headers=self.headers
).json()
step = result.get('step', 'unknown')
if step == 'identifying':
print("Analyzing your query...")
elif step == 'generating_schema':
print("Creating data structure...")
elif step == 'generating':
pages = result.get('pages_read', 0)
matched = result.get('matched_records', 0)
print(f"Processing data: {pages} pages read, {matched} matches found")
elif step == 'done':
print("Research completed!")
return result
time.sleep(3)
# Usage
api = DeepLookupAPI("YOUR_API_KEY")
columns = [
{
"name": "company_name",
"description": "Name of the company",
"type": "enrichment"
},
{
"name": "is_ai_company",
"description": "Must be an AI/ML focused company",
"type": "constraint"
},
{
"name": "employee_count",
"description": "Number of employees",
"type": "enrichment"
},
{
"name": "min_50_employees",
"description": "Must have at least 50 employees",
"type": "constraint"
}
]
results = api.research_with_spec(
"Find all AI companies in Israel with more than 50 employees",
columns,
limit=100
)
print(f"Found {results['matched_records']} companies")
print(f"Skipped {results['skipped_records']} companies (didn't match all criteria)")
print(f"Total cost: {results['total_cost']}")
```
```javascript Node.js - Preview and Execute with Progress Monitoring theme={null}
const axios = require('axios');
class DeepLookupAPI {
constructor(apiKey) {
this.apiKey = apiKey;
this.baseURL = 'https://api.brightdata.com/datasets/deep_lookup/v1';
this.headers = {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
};
}
async previewAndExecute(query, limit = 100) {
// Create preview
const previewResponse = await axios.post(
`${this.baseURL}/preview`,
{ query },
{ headers: this.headers }
);
const previewId = previewResponse.data.preview_id;
// Wait for preview to complete
let previewData;
do {
await new Promise(resolve => setTimeout(resolve, 2000));
const response = await axios.get(
`${this.baseURL}/preview/${previewId}`,
{ headers: this.headers }
);
previewData = response.data;
} while (previewData.status !== 'completed');
console.log('Preview ready with', previewData.sample_data.length, 'samples');
// Trigger full research
const triggerResponse = await axios.post(
`${this.baseURL}/trigger`,
{
preview_id: previewId,
result_limit: limit
},
{ headers: this.headers }
);
const requestId = triggerResponse.data.request_id;
// Monitor detailed progress
let lastStep = '';
let result;
do {
await new Promise(resolve => setTimeout(resolve, 3000));
const response = await axios.get(
`${this.baseURL}/request/${requestId}`,
{ headers: this.headers }
);
result = response.data;
if (result.step !== lastStep) {
lastStep = result.step;
switch(result.step) {
case 'identifying':
console.log('Analyzing query...');
break;
case 'generating_schema':
console.log('Creating data structure...');
break;
case 'generating':
console.log('Collecting data from sources...');
break;
case 'done':
console.log('Research completed!');
break;
}
}
if (result.step === 'generating' && result.matched_records) {
console.log(` Found ${result.matched_records} matches so far...`);
}
} while (result.step !== 'done' && result.status !== 'failed');
return result;
}
async enrichResults(requestId, columnName, columnQuery) {
// Add enrichment column
const enrichResponse = await axios.post(
`${this.baseURL}/request/${requestId}/enrich`,
{
column_name: columnName,
query: columnQuery
},
{ headers: this.headers }
);
console.log(`Adding "${columnName}" column...`);
console.log(`Maximum additional cost: ${enrichResponse.data.max_additional_cost}`);
// Wait for enrichment to complete
// (Implementation would depend on actual API behavior)
return enrichResponse.data;
}
}
// Usage
const api = new DeepLookupAPI('YOUR_API_KEY');
async function runResearch() {
const results = await api.previewAndExecute(
'Find all B2B marketplaces in Europe',
50
);
console.log(`Found ${results.matched_records} marketplaces`);
console.log(`Skipped ${results.skipped_records} that didn't match criteria`);
console.log(`Cost: ${results.total_cost}`);
// Add an enrichment column
await api.enrichResults(
results.request_id,
'ceo_name',
'CEO or founder name'
);
}
runResearch();
```
# Error Handling
Source: https://docs.brightdata.com/datasets/deep-lookup/errors
Reference for Deep Lookup API error codes and response formats, including INVALID_QUERY and other common errors with resolution guidance.
## Error Response Format
```json theme={null}
{
"error": {
"code": "INVALID_QUERY",
"message": "Query must start with 'Find all'",
"details": "Please rephrase your query to begin with 'Find all' followed by what you're looking for"
}
}
```
## Common Error Codes
| Code | Description | Resolution |
| :--------------------- | :------------------------------- | :-------------------------- |
| `INVALID_API_KEY` | API key is missing or invalid | Check your API key |
| `RATE_LIMIT_EXCEEDED` | Too many requests | Wait and retry with backoff |
| `INVALID_QUERY` | Query format is incorrect | Follow query guidelines |
| `INSUFFICIENT_CREDITS` | Account has insufficient credits | Add credits to account |
| `REQUEST_NOT_FOUND` | Request ID doesn't exist | Verify request ID |
| `PROCESSING_ERROR` | Internal processing error | Contact support |
# Deep Lookup Documentation
Source: https://docs.brightdata.com/datasets/deep-lookup/overview
Deep Lookup is an AI-powered research tool for finding companies, professionals, and entities from 1,000+ public sources with 95%+ accuracy.
## Introduction
Deep Lookup is an AI-powered research tool that allows you to search the public web like a database. Find any set of companies or professionals with precision and get table-ready structured data. With Deep Lookup, you can quickly identify real-world entities such as professionals, companies, products, news and articles, locations, and events that meet specific filters and criteria.
## Why Deep Lookup?
Turn weeks of manual research into minutes. Deep Lookup searches 1,000+ sources simultaneously, delivering comprehensive results instantly.
Our advanced AI models validate data across multiple sources, ensuring the highest quality results for critical business decisions.
No complex query builders or SQL knowledge needed. Simply describe what you're looking for in plain English, starting with "Find all..."
Never pay for unmatched results. Our transparent pricing means you only pay for the verified data you receive, not for skipped or filtered-out records.
## Key Features
**Universal Entity Search**\
Find companies, professionals, products, news, locations, and events using specific, measurable criteria.
**Smart Data Enrichment**\
Automatically add valuable data points to your results:
* Contact information and emails
* Revenue and employee counts
* Technology stacks and funding data
* Social profiles and company details
**Preview Mode**\
Get 10 free sample records to validate your approach before running the full research.
**Post-Run Enhancement**\
Add new data columns even after your initial research is complete.
**Source Transparency**\
See exactly where each data point comes from with full source attribution.
**Real-Time Extraction**\
Access the most current data available from the public web, not outdated databases.
## How Deep Lookup Works
Begin with "Find all" followed by what you're looking for.
The more specific your query, the better your results. Deep Lookup works best when you provide clear, measurable criteria.
Get 10 free sample records to validate your approach before running the full research.
## Understanding Column Types
When building your research in Preview Mode, you'll work with two types of columns:
### How enrich columns work
**Purpose:** Add new attributes to each result using data from the public web.
These columns provide additional information about your entities without filtering results. For example:
* CEO name for a company
* Revenue figures
* Contact information
* Social media profiles
* Technology stack used
### How filter columns work
**Purpose:** Limit results by applying specific conditions to your query.
These columns determine which records are included in your dataset. For example:
* Companies with revenue greater than \$10 million
* Professionals with 5+ years experience
* Products priced under \$100
* Articles published in the last 30 days
## Understanding "Skipped" Results
When you see "skipped" in a cell, it means the candidate entity did not match at least one of your filter criteria. This is how Deep Lookup ensures precision in your results.
### Important Points About Skipped Results:
* **More filters = fewer results**: The more filters you apply, the more precise your results become, but you may get fewer total matches
* **Each filter must be satisfied**: All filter conditions must be met for an entity to be included
* **Optimization tip**: If you're getting too many skipped results, consider removing some filters to broaden your search
* **Quality over quantity**: Skipped results ensure you only pay for data that exactly matches your criteria
### Example:
If you search for "Find all SaaS companies in California with >100 employees AND revenue >\$10M AND founded after 2020":
* A company with 150 employees and \$15M revenue but founded in 2019 would be skipped
* Only companies meeting ALL criteria appear in your final results
## Deep Lookup Modes
### Preview Mode (Recommended)
The best way to ensure accurate results before spending credits:
* **Free preview results** - Get up to 10 sample records at no cost to understand expected outcomes
* **Query refinement** - Deep Lookup's AI assistant helps you refine and improve query accuracy
* **Column customization** - Review and adjust column names, types, and specifications before running
* **Specs panel control** - Tweak column settings, reorder fields, change types, or add new columns
* **Runtime estimates** - See processing time and cost estimates before committing
### When to use Instant Mode
For fast, one-shot results when you know exactly what you need:
* Run queries immediately for simple, well-defined searches
* Best for repeat queries with proven parameters
* Skip the preview step when you're confident in your criteria
### When to use Advanced Mode
For complex, multi-step research:
* Build sophisticated queries with multiple refinement steps
* Perfect for deep market research or complex data gathering
* Combine multiple data sources and relationships
## Writing Effective Queries
### The Basic Structure
Start every query with "Find all" and describe exactly what you need:
```text theme={null}
Find all B2B SaaS companies in Texas with revenue greater
than $10 million and less than 100 employees founded after 2020
```
### Key Success Tips
**1. Use Natural Language Comparisons**
* "greater than 50 employees"
* "less than \$5 million in revenue"
* "between 100 and 500 employees"
**2. Specify Geographic Boundaries**
* "in California"
* "headquartered in London"
* "operating in Southeast Asia"
**3. Add Revenue or Size Constraints**
* "with annual revenue greater than \$20 million"
* "having between 50 and 200 employees"
**4. Combine 2-4 Specific Criteria**
The sweet spot for queries is combining a few specific requirements
**5. Include Industry-Specific Terms**
* For tech: "SaaS", "API-first", "cloud-native"
* For finance: "Series B funded", "EBITDA positive"
* For retail: "D2C", "omnichannel", "subscription-based"
## Popular Use Cases
### Sales & Lead Generation
**B2B Decision Maker Targeting** - 100% match rate
```text theme={null}
Find all VPs of Sales at fintech companies in New York with
50-200 employees including their verified email addresses
```
### How to research competitors
**Competitor Pricing Research** - 97% match rate
```text theme={null}
Find all project management software with pricing between
$10-100 per user that offer kanban board features
```
### Recruitment & Talent Acquisition
**Executive Search** - 96% match rate
```text theme={null}
Find all Chief Technology Officers at healthcare companies
in California with more than 500 employees
```
### How to run market research
**Industry Analysis** - 95% match rate
```text theme={null}
Find all electric vehicle manufacturers in the United States
with manufacturing facilities and more than 500 employees
```
### How to research investments
**Deal Flow Discovery** - 90% match rate
```text theme={null}
Find all Series A fintech startups in Europe that raised
between $5M and $20M in the last 18 months
```
## What success looks like
| Metric | Value | Impact |
| :--------------- | :---------------- | :--------------------------------------- |
| **Time Saved** | 20-30 min/record | Weeks of research completed in minutes |
| **Data Sources** | 1,000+ per result | Comprehensive validation |
| **Match Rates** | 90-100% | Industry-leading accuracy |
| **ROI** | 10-20x | Each lead worth \$10-20 in research time |
## What to Avoid
### Overly Broad Queries
**Avoid:** "Find all companies"\
**Better:** "Find all software companies in Boston with 50 to 200 employees"
### Too Many Constraints
**Avoid:** Queries with more than 5-6 different criteria\
**Better:** Focus on 2-4 most important criteria
### Avoid vague terms
**Avoid:** "Find all big companies"\
**Better:** "Find all companies with revenue greater than \$100 million"
### Individual Person Searches Without Context
**Avoid:** "Find all people named John"\
**Better:** "Find all executives named John at Fortune 500 companies"
## FAQ
Deep Lookup provides structured, table-ready data from 1,000+ sources. Unlike search engines that give you links or AI that generates text, we deliver verified, actionable business data you can immediately use.
Skipped results indicate entities that don't match all your filter criteria. To get more results, try removing some filters or making your criteria less restrictive. Remember, you only pay for matched results, not skipped ones.
Yes! In Preview Mode, you can adjust column types, add new columns, and modify specifications before running your full research.
95%+ accuracy with full source transparency. Every data point is validated across multiple sources before delivery.
Yes! Preview Mode gives you 10 free sample records for any query to validate quality before running a full research project.
## Ready to Start Your Research?
Start with 5 free queries to experience the power of Deep Lookup. No credit card required.
[Get Started Free](https://brightdata.com/products/deep-lookup) | [View Pricing](https://docs.brightdata.com/datasets/deep-lookup/pricing) | [API Documentation](https://docs.brightdata.com/api-reference/deep-lookup)
# Deep Lookup - Pricing
Source: https://docs.brightdata.com/datasets/deep-lookup/pricing
Bright Data Deep Lookup pricing is pay-per-result: you pay only for successfully matched records. Skipped or unmatched results are free. Achieves 95% accuracy.
## Pricing Structure
**\$1.00** per matched record
* 10 enrichment columns included
* Additional columns at \$0.05 each
* 95%+ accuracy
* Source transparency
* Post-run enrichment
* Real-time data extraction
* No charge for skipped results
## How Our Pricing Works
Define your research criteria and choose enrichment columns (first 10 included)
View the worst-case cost before running - full transparency, no surprises
You're only charged for successfully matched records, typically 40-60% less than the maximum
## Understanding Skipped Results & Pricing
**Important:** You never pay for skipped results. When Deep Lookup filters out entities that don't match your criteria, those skipped records don't count toward your bill. This ensures you only pay for data that exactly meets your requirements.
## Pricing Calculator
### Example: Finding 500 B2B Leads
**Your Research:**
* Target: 500 SaaS companies
* Enrichments: 5 columns (all included)
**Results:**
* Records searched: 500
* **Actual matches: 300**
* Skipped (not charged): 200
* Maximum estimate: \$500
* **You pay: \$300**
* **You save: \$200 (40%)**
### Example: Deep Competitor Research
**Your Research:**
* Target: 200 competitors
* Enrichments: 15 columns (10 included + 5 extra)
**Results:**
* Records searched: 200
* **Actual matches: 100**
* Skipped (not charged): 100
* Maximum estimate: \$250
* **You pay: \$125**
* **You save: \$125 (50%)**
## How volume discounts work
Scale your research with progressive savings:
| Monthly Volume | Discount | Your Price |
| :------------------- | :------- | :------------ |
| 1-1,000 records | - | \$1.00/record |
| 1,001-5,000 records | 20% off | \$0.80/record |
| 5,001-10,000 records | 30% off | \$0.70/record |
| 10,000+ records | Custom | Contact sales |
## How Deep Lookup compares for ROI
| Alternative | Cost | Time | Quality |
| :-------------- | :---------- | :------------ | :---------------- |
| Manual Research | \$30-50/row | 20-30 min/row | Variable |
| **Deep Lookup** | **\$1/row** | **Instant** | **95%+ accuracy** |
### Cost Breakdown Example
**Traditional Manual Research:**
* Analyst hourly rate: \$60-100/hour
* Time per record: 20-30 minutes
* Cost per record: \$30-50
* 100 records = \$3,000-5,000
**Deep Lookup:**
* Cost per record: \$1
* Time per record: Instant
* 100 records = \$100
* **Savings: \$2,900-4,900 (97% cost reduction)**
## What every project includes
### Every Research Project Includes
* Real-time data extraction
* 1,000+ sources per result
* Natural language queries
* CSV/JSON export
* Preview mode (10 free samples)
* Source transparency
* Post-run enrichment
### Which enrichments are popular
**Included (first 10 columns):**
* Company name & website
* Industry & description
* Location & headquarters
* Employee count
* Revenue estimates
* Contact emails
* Phone numbers
* Social profiles
* Technology stack
* Funding information
**Additional enrichments (\$0.05 each):**
* Executive names & titles
* Competitor analysis
* News mentions
* Product details
* And 100+ more options
## Frequently asked billing questions
We show the worst-case scenario for full transparency. You'll only pay for actual matches, typically 40-60% less.
You never pay for skipped results. These are entities that didn't match your filter criteria and are automatically excluded from billing.
Yes! Add new columns anytime at \$0.05 per column per matched record.
Never. You only pay for matched records. No setup fees, minimums, or charges for failed matches or skipped results.
You get 5 free queries to test Deep Lookup. Each query can search up to 100 records.
We offer a satisfaction policy. If results don't meet expectations, contact support for resolution.
## Ready to Start?
* \- 5 queries included
* \- Up to 100 records each
* \- No credit card required
*
* \- No minimum commitment
* \- Cancel anytime
* \- Usage-based billing
*
* \- Volume discounts
* \- Custom pricing
* \- Dedicated support
* \- API access
# Customization and Filtering
Source: https://docs.brightdata.com/datasets/marketplace/customization-and-filtering
Customize fields and apply filters to Bright Data Marketplace datasets (250+ domains) using the UI or API to get the data subset you need.
## How to customize fields
## How to filter a dataset
### Filtering a Dataset Using the UI
1. Start by navigating to the dataset you want to filter.
2. Click on the top right button with the filter icon.
3. A menu will appear where you can add a name for the view to help you identify the query.
4. Choose the appropriate filters under the "Include filters" section, such as country, job title, or date.
5. Click the "create subset" button to create the filtered view.
### Which filter functions are available
#### **Select**
* Use this function to select one or more exact matches from a predefined list, such as countries or regions. This is useful when dealing with a known set of values.
#### **Boolean**
* Filter by boolean values (true or false) to find specific records meeting criteria, such as verified or unverified social media profiles.
#### **Date**
* Use this filter to specify a date range, with start and end dates, useful for trend analysis or filtering events by date.
#### **Number**
* Options include:
* **Is:** Matches exact numerical values.
* **Not:** Excludes specified values.
* **Exist:** Filters non-empty numerical fields.
* **List (exact match):** Matches exact values from a list.
* **Lower than / Lower or equal to:** Filters values below or up to a specific number.
* **Greater than / Greater or equal to:** Filters values above or up to a specific number.
#### **String**
* Filter by string values using options like exact match, exclusion, list matching, or pattern inclusion to refine searches.
#### **Array**
* Use the "Array includes" filter to check if a value exists within an array field, such as specific tags.
### Creating Rule-Based Group Filters
* Create group filters by clicking on the “+Add filter” dropdown and selecting “Add group.”
* Group filters allow you to set rules, such as filtering a dataset where the category is “Electronics” and the brand is “Dell” or “Apple.”
### Limitations
* Groups cannot be nested, and a maximum of two groups per filter is allowed.\
If you require more complex queries, contact your account manager for assistance with custom queries and filters.
* 4-Input per group.\
There's a limit of 4 inputs per filter group. To filter by more values, upload them as a CSV list instead of adding each one separately. This lets you filter by many values using just one input slot.
# Marketplace data delivery
Source: https://docs.brightdata.com/datasets/marketplace/data-delivery-and-export
Receive Bright Data Marketplace datasets in JSON, NDJSON, CSV, XLSX or Parquet via email, API, webhooks or cloud destinations like S3 and Snowflake.
Once you purchase or subscribe to a dataset, Bright Data delivers it directly to your preferred destination. Choose from 9 delivery methods and 5 output formats to fit your existing infrastructure and workflow.
## Output formats
Datasets are available in the following formats:
| Format | Description |
| ------- | ------------------------------------------- |
| JSON | Standard structured format |
| NDJSON | Newline-delimited JSON, ideal for streaming |
| CSV | Spreadsheet-compatible format |
| XLSX | Microsoft Excel format |
| Parquet | Columnar format optimized for analytics |
You can also receive data in compressed (gzip) format to reduce file size.
## Which delivery methods are supported
Choose how and where your data is delivered:
| Method | Description |
| -------------------- | ---------------------------------------------------- |
| Email | Receive datasets directly in your inbox |
| API download | Download via the Bright Data API using a snapshot ID |
| Webhook | Get data pushed to your endpoint automatically |
| Amazon S3 | Deliver directly to your S3 bucket |
| Google Cloud Storage | Deliver to your GCS bucket |
| Google Cloud Pub/Sub | Stream data via GCP Pub/Sub |
| Microsoft Azure | Deliver to your Azure Blob Storage container |
| Snowflake | Load directly into your Snowflake data warehouse |
| SFTP | Deliver via secure file transfer protocol |
## How to set up delivery
After purchasing a dataset, go to **Control Panel → My Datasets**.
Select your dataset and click **Delivery Settings**.
Pick your preferred delivery method and output format.
Enter the destination credentials (bucket name, webhook URL, SFTP host, and so on).
Click **Save**. Your data is delivered automatically on each refresh.
## Related
* [Purchase options](/datasets/marketplace/purchase-options)
* [Pricing](/datasets/marketplace/pricing)
* [Filter dataset by API](/datasets/marketplace/filter-dataset-by-api)
# Dataset View
Source: https://docs.brightdata.com/datasets/marketplace/dataset-view
Browse dataset details in the Bright Data Marketplace (250+ domains), including data fields, freshness, sample downloads and purchase options.
## How to preview a data sample
* Data-fields
* Records
* Freshness
* Min order
* Download data sample: JSON, CSV, number records (display 30 records but you get 1,000 records)
* CSV/JSON
* Create a subset
* Contact us
* Purchase options
### How to customize a dataset
* Customize - Create a custom subset by hiding/showing and renaming fields and filter the dataset according to your requirements.
* Customize Fields
* Filter
## Where to view the data dictionary
* Search attribute
* Table
* Column name
* Description
* Data type
## When to use this dataset
* Case study
* Related subset
* What category it’s related to
## What this dataset covers
* Description about the dataset
* What domain we collect from
* Use case categories
* Customize dataset CTA
* Delivery options
* FAQ
## Initiate by API
* Ability to submit a request through the API or the UI to collect fresh data based on your own URLs (inputs)
* +Add input
* Upload CSV
* Creating API key to initiate a collection
* Setting the expected records
* Data Collection Choices
* Discover new records
* Specific URL collection
* Purchase options
# Dataset Marketplace FAQs
Source: https://docs.brightdata.com/datasets/marketplace/faqs
FAQs about Bright Data's Dataset Marketplace (250+ domains), including available datasets, delivery options and purchase choices for ready-to-use data.
The following is a partial list of datasets available for immediate download from the Datasets Marketplace:
**Popular Datasets**
* Amazon products
* Crunchbase companies information
* Facebook - Posts by group URL
* Github repository
* Glassdoor companies overview information
* LinkedIn company information
* LinkedIn people profiles
* LinkedIn posts
* Zillow properties listing information
The datasets are broken up by category. Here are the main categories and some of the datasets:
**eCommerce Data**
* amazon.com, amazon.co.uk, amazon.de, amazon.es, amazon.fr, amazon.in, amazon.it,
* homedepot.com, homedepot.ca
* lazada.com.my, lazada.sg, lazada.vn
**Real Estate Data**
These datasets include information regarding housing data, real estate prices, rent prices etc.
* Bayut UAE Property Listings
* Booking.com Property Listings
* Dubizzle UAE Property Listings
* PropertyFinder Property Listings
* US Consumer Property
* ZoomProperty UAE Property Listings
* infocasas.com.uy
* inmuebles24.com
* metrocuadrado.com
* otodom.pl
* properati.com.co
* realestate.com.au
* toctoc.com
* zillow\.com
* zonaprop.com.ar
* zoopla.co.uk
**Social Media Data**
* facebook.com
* instagram.com
* linkedin.com
* pinterest.com
* reddit.com
* tiktok.com
* unashamedcataddicts.quora.com
* vimeo.com
* x.com
* youtube.com
**Travel Data**
* Booking.com Hotel Room Pricing and Availability
* Deliveroo Restaurant Listings
* OpenTable Restaurant Listings
* Short-Term Rental Occupancy & Pricing Dataset
* Talabat Restaurant Listings
* Tripadvisor Restaurant Listings
* Zomato UAE Restaurant Listings
* airbnb.com
**B2B Data**
* Business Contacts Dataset
* Business Firmographic Data
* Business Intelligence Dataset
* Business Location (POI) Dataset
* Companies Hierarchy Dataset
* Online Intent Data
* Politically Exposed Persons List
* Tech Install base Data Feed
* US B2B Employees
* US Consumer Demographics
* crunchbase.com
* g2.com
* glassdoor.com
* google.com
* indeed.com
* linkedin.com
* manta.com
* owler.com
* slintel.com
* stackoverflow\.com
* trustpilot.com
* ventureradar.com
* xing.com
* yelp.com
The datasets marketplace is continously updated with fresh datasets. For the complete list, click on "Web Data" on the sidebar, and then on "Datasets Marketplace" on the top bar.
If the domains you need aren't exist in the Marketplace, you can request them through the Custom Dataset (CDS).
Yes! you can download a few free datasets:
* espn.com - NBA data
* goodreads.com
* imdb.com
* worldpopulationreview\.com
The datasets marketplace is continously updated with fresh datasets. For the complete list, click on “Web Data” on the sidebar, and then on “Datasets Marketplace” on the top bar.
The schedule run is designed to ensure timely delivery.
The delivery deadline is calculated based on previous collection cycles and the estimated refresh duration.
Therefore, the collection may start earlier than the delivery date to guarantee that the data is delivered on time.
You can find your data snapshots under the "My datasets" tab. There, you'll see a table with information about each snapshot, including its status: ready, failed, or in building.
A Snapshot ID is a unique identifier assigned to a specific data snapshot, formatted as "snap\_XXXXXX".
You should use the Snapshot ID whenever there is an issue with a particular data snapshot. Including this ID in your support ticket helps the support team quickly identify the exact snapshot in question, leading to faster issue resolution.
The Snapshot ID ensures that both you and the support team are referring to the same data set, reducing confusion and delays in addressing your problem.
You can set a record limit in two ways:
**Using the control panel:** Before purchasing a dataset, click “Proceed to purchase.” On the “Choose delivery frequency” page, select the “Too pricey? Limit dataset records” option to specify your desired record limit.
**Through the Filter API:** Add a parameter to limit the number of records returned by the API. For reference, see: [Dataset Filter API - records\_limit](https://docs.brightdata.com/api-reference/marketplace-dataset-api/filter-dataset#body-records-limit).
Currently, there is no monthly commitment or minimum order of \$250 required when using the Dataset Filter API. You only pay based on your actual record consumption.
When you submit a dataset filter API request, compute resources are used to identify records matching your filter criteria. If matching records are found, you will be charged based on the amount of these matched records. However, if no matching records are found, you will not be charged. To avoid charges while exploring your filter criteria, you can test filters through the dataset preview table in the control panel, which offers up to 10 free filters per day.
Some fields may have lower fill rates due to limitations or gaps in the publicly available source data. Fill rates vary depending on dataset type and source quality - which can result in partial coverage for specific attributes. We provide detailed fill rates and statistics for each dataset to help you evaluate completeness before purchasing.
Bright Data offers several services for accessing and managing datasets:
1. **Dataset Marketplace**: This is a centralized platform where you can discover, customize, and purchase high-quality datasets from over 250 domains. You can browse pre-built datasets across multiple domains, examine data samples, and apply advanced filters. [Explore the Dataset Marketplace here](https://brightdata.com/datasets/marketplace/browse).
2. **Dataset APIs**: These APIs allow you to request, initiate, and manage data collections. You can define parameters for new dataset collections, check the status of your requests, and download datasets using snapshot IDs. [Learn more about Dataset APIs here](https://brightdata.com/api-reference/marketplace-dataset-api/request-a-collection).
3. **Deep Lookup**: This service provides a more granular and streamlined way to request and manage data collections, facilitating effective dataset generation according to your specific needs. [Explore Deep Lookup here](https://docs.brightdata.com/datasets/deep-lookup/overview).
Would you like more information on how to use any of these services?
* By default, standard LinkedIn profile records **do not** include email addresses or phone numbers. This information is not publicly available on LinkedIn.
* However, Bright Data offers an **enriched business contact** solution (in partnership with RevenueBase) that adds business emails and phone numbers for many LinkedIn people profiles, fully GDPR-compliant and sourced via third-party validation.
* Contact data coverage may vary by profile and use case.
* In the Dataset Marketplace, after selecting "LinkedIn People Profiles", use the **Contact filters** button (on the right side of the Data sample view) to choose your contact data options:
* **Standard LinkedIn profile data:** No contact info.
* **Enriched business contact info:** Select “Standard Profiles + Enriched with Business Contact Info” or “Only Profiles with Business Contact Info” to receive available business emails and phone numbers (where provided via RevenueBase partnership and in accordance with GDPR/compliance).
* Click “Apply filter” to preview and purchase the dataset with your chosen contact enrichment.
Yes. All provided business contact data is sourced and processed according to GDPR and other compliance requirements, using approved partners such as RevenueBase.
Yes. Bright Data’s [Deep Lookup](https://brightdata.com/cp/deep-lookup) can search for people and return available business contact details (email/phone), where legally sourced and compliant. Specify your entity and required columns in the query (e.g., email, phone).
* Go to Control Panel → Dataset Marketplace → LinkedIn People Profiles.
* Click “Preview sample” to review all available fields.
* For enriched datasets, use the **Contact filters** panel as described above, and preview sample rows before placing your order.
For full details on compliance, permissible usage, and supported geographies, speak directly with your Bright Data account manager or reach out via [Support](https://brightdata.com/cp/support).
**Summary:**
* **Standard LinkedIn profiles** don’t include emails/phones.
* **Enriched business contact info** (email/phone) is available: just use the Contact filters button in the Dataset Marketplace view.
* **Deep Lookup** is another route for contact discovery.
* **Always review** the filtered sample before purchasing, and contact support for custom requirements.
Let me know if you want a live demo, pricing, or coverage estimate!
The Filter API rate limit is 120 requests per hour. This applies to all API calls and snapshot triggers within the specified time frame.
Note: Plan your API calls accordingly to stay within the hourly limit. Consider implementing retry logic with exponential backoff for optimal performance.
You can send up to 10,000 input lines in a single API request when using list filters or include filters.
Best Practice: For large datasets, consider batching your requests to stay within the 10,000 line limit while maintaining efficient processing.
The maximum input file size is 200 MiB for any single API request.
Warning: Files exceeding 200 MiB will be rejected. Compress your data or split large files into smaller chunks before submission.
You can download snapshots up to 5 GB as a single file.
For snapshots larger than 5 GB, the API will automatically provide chunked download options or streaming capabilities to handle the data efficiently.
### Quick Reference
| Limit Type | Value | Description |
| :---------------- | :------- | :------------------------------------- |
| Rate Limit | 120/hour | Maximum API calls per hour |
| Input Lines | 10,000 | Maximum values in list/include filters |
| Input File Size | 200 MiB | Maximum size for uploaded files |
| Snapshot Download | 5 GB | Maximum size for single-file download |
**Need Higher Limits?**\
Contact our Enterprise team for custom rate limits and increased capacity options tailored to your business needs.
Snapshot IDs are returned when you [Trigger Collection](https://docs.brightdata.com/api-reference/rest-api/scraper/asynchronous-requests) (POST `/datasets/v3/trigger`), [Filter Dataset](https://docs.brightdata.com/api-reference/marketplace-dataset-api/filter-dataset-with-csv-json-files) (POST `/datasets/filter`), or via a dataset subscription. You can also [list all your snapshots](https://docs.brightdata.com/api-reference/marketplace-dataset-api/get-dataset-list) with GET `/datasets/v3/snapshots`.
> See "[Before you begin](https://docs.brightdata.com/api-reference/marketplace-dataset-api/deliver-snapshot#before-you-begin)" for more details.
The `id` in the response is a delivery job ID. Use it to track delivery progress by calling GET `/datasets/v3/delivery/{delivery_id}`. Poll until status is "done".
> See [Tracking delivery status](https://docs.brightdata.com/api-reference/marketplace-dataset-api/deliver-snapshot#tracking-delivery-status) for more details.
Yes. The snapshot must be in `ready` status. Check with GET `/datasets/snapshots/{id}` before calling deliver.
> Possible statuses: `scheduled`, `building`, `ready`, `failed`.
Yes. Call this endpoint multiple times with different delivery configurations for the same snapshot ID.
json, jsonl, and csv.
Use the `batch_size` parameter to set the number of records per file. Each file (batch) must stay under 5GB. Calculate the right value by dividing 5GB by your average record size. See "Calculating batch size" above.
The most common cause is that your `batch_size` produces a file larger than 5GB. For example, if your average record size is \~5KB, a `batch_size` of 1,000,000 produces a \~5GB file that may exceed the limit. Lower your `batch_size` (e.g., to 100,000) and retry.
Yes. Set `compress: true` to receive gzip-compressed files.
5GB. This is a hard limit per delivered file. Use `batch_size` to control how many records go into each file and ensure each stays under this threshold.
# Marketplace fill rates
Source: https://docs.brightdata.com/datasets/marketplace/fill-rates-and-statistics
Review per-field fill rates, record counts and data freshness on the Statistics tab of any Bright Data Marketplace dataset (250+ domains) before purchase.
Every Bright Data Marketplace dataset page exposes a Statistics tab with field-level fill rates, the total record count and a freshness indicator. Review these before purchase to confirm the dataset meets your requirements.
## What are fill rates?
A fill rate is the percentage of records in a dataset where a specific field is populated. A fill rate of 100 percent means every record contains a value for that field. A lower fill rate means some records are missing values for that field.
Fill rates vary depending on:
* The dataset type (social media, B2B, eCommerce)
* The source quality and what is publicly available
* The specific field being measured (email, phone number, location)
## Where to find fill rates
Fill rates and statistics are shown on each dataset's page in the Marketplace.
In the Bright Data Control Panel, go to **Dataset Marketplace**.
Open the dataset whose coverage you want to review.
Navigate to the **Statistics** tab to view field-level fill rates and record counts.
## What statistics are available?
For each dataset you can review:
* **Fill rate per field**: percentage of records with a value
* **Record count**: total number of available records
* **Freshness**: how recently the data was collected
## Why do some fields have low fill rates?
Some fields have lower fill rates because of limitations or gaps in the publicly available source data. For example, phone numbers and email addresses are not always publicly listed, so contact fields commonly show lower fill rates than name or location fields.
Always review fill rates before purchasing to confirm the data meets your specific requirements.
# Filter dataset by API
Source: https://docs.brightdata.com/datasets/marketplace/filter-dataset-by-api
Filter Bright Data Marketplace datasets programmatically using the Filter Dataset API. See the full API reference for parameters and examples.
See [Filter Dataset](/api-reference/marketplace-dataset-api/filter-dataset) API Reference page
# Dataset Marketplace Overview
Source: https://docs.brightdata.com/datasets/marketplace/overview
Browse and buy 350+ ready-to-use datasets from 250+ domains. Updated on a schedule, customizable by field and delivered in your preferred format.
## Introduction to the Dataset Marketplace
The Dataset Marketplace is a one-stop platform for discovering, customizing and purchasing high-quality datasets from over 250 domains. Datasets are regularly updated and ethically sourced from publicly available information, ensuring both accuracy and compliance. Whether you need ready-to-use or on-demand solutions covering social media, real estate, B2B data or AI training, flexible purchase options and field-level customization give you the exact data you need.
## Get started
Browse a wide range of pre-built, validated datasets across multiple domains.
Examine data samples, data fields and customization options before purchase.
Refine datasets with advanced filters, custom field selection and rule-based subsets.
Preview and download a sample of any dataset in multiple formats.
Review data completeness and field coverage to ensure accuracy.
Choose between one-time purchases, subscriptions and flexible data delivery models.
Understand per-record costs, compute fees and available pricing plans.
Learn how to receive datasets via email, API, webhooks and cloud services including S3, Azure, Google Cloud, Snowflake and SFTP.
# Pricing
Source: https://docs.brightdata.com/datasets/marketplace/pricing
Understand how Bright Data Marketplace datasets are priced, including per-record costs and compute fees for pre-collected and fresh records.
* Pre-collected records
* Fresh records
* Pricing is determined based on compute cost and record cost to reach and collect the data.
For delivery methods and output formats, see [Data delivery and export](/datasets/marketplace/data-delivery-and-export).
# Purchase Options
Source: https://docs.brightdata.com/datasets/marketplace/purchase-options
Compare 2 Bright Data Marketplace purchase options: one-time or subscription, pre-collected or fresh data, with supported delivery destinations.
* Purchase options → Flexibility
* One-time purchase
* Subscription
* Select data type
* Pre-collected data - Access data that was collected recently and is ready for use.
* Fresh data - Access data that's up-to-date and fresh, available immediately after it's collected.
* Data distribution - 544M records available spanning from Apr 12 2021, to Feb 27 2024
* Specific time range - Access data from a specific time range.(back from today)
* Available only for pre-collected.
* Select Period
* Collect fresh data for the rest of the records
* Records outside of the specified time range will be scraped when the request is processed and will incur an additional fee. The refresh time depends on the size of the dataset. You'll get an email when your snapshot is ready.
* As the data changes, these records may no longer match your filters and may be excluded from your dataset. If these records later do not meet your filters, they will be removed, but you will still be charged for the initial scrape.
* Set record limit - Enter the number of records you would like. Records are selected randomly. Notice that the price will change according to the record count.
* Order
* Dataset name - “LinkedIn people profile dataset”
* Total amount of records - 544M records
* “Records” - cost of total records
* Number of records - 544M records
* Cost per record - \$0.0006
* Subtotal - (before VAT or salestax)
* Total
* Price adjustment (max snapshot cost)
* Small dataset fee (min snapshot cost)
# Build a scraper with the AI Agent
Source: https://docs.brightdata.com/datasets/scraper-studio/ai-agent
Use Bright Data Scraper Studio's AI Agent to build a custom web scraper from a natural-language prompt in about 10 minutes, no coding required.
Bright Data Scraper Studio's AI Agent builds a custom web scraper from a natural-language description. Give it a target URL and describe the data you want; the agent generates an output schema, writes the scraper code, and hands you a runnable scraper you can trigger by API, manually, or on a schedule. This tutorial walks through the full flow.
**Time to complete:** about 10 minutes
## Prerequisites
* A Bright Data account ([sign up free](https://brightdata.com/?hs_signup=1\&utm_source=docs))
* The URL of the website you want to scrape
## Build your first scraper with the AI Agent
In the Bright Data control panel, click **Scrapers** in the left menu and open **Scraper Studio**.
Paste the URL of the page you want to scrape into the chat input. Along with the URL, add any context that helps the AI build a more accurate scraper on the first try. The more context you provide, the better the generated code.
Useful context to include:
* **Specific fields** you need: "I need price, title, and stock status"
* **Where the data lives** on the page: "prices are in the product detail panel, not the listing page"
* **Actions required** to reach the data: "click 'Show more' to load full descriptions"
* **CSS selectors**, if you know them: `.product-price span.amount`
* **Page load behavior**, if the site is slow or lazy-loads content: "results load dynamically, give it extra time"
> **Expected result:** the AI Agent acknowledges the URL and may ask one or two clarifying questions about the data you want.
Respond in plain language.
> **Expected result:** the AI Agent generates a schema, a structured list of fields with data types that will become your scraper's output.
Read through the generated schema. You have four options:
* **Approve:** click Approve to accept the schema as-is
* **Decline:** type feedback in the chat (for example, "Remove the image field and add a rating field") and the AI regenerates the schema
* **Edit inline:** modify the schema directly without going back to the chat
* **Upload your own schema:** bring your own schema file; download the example file to see the correct format
Inline editing options:
* **Edit a field** (pencil icon): change a field's name or data type
* **Delete a field** (trash icon): remove fields you do not need
* **Add a field** (plus button): add new fields to the schema
Additional controls:
* **Start from scratch:** clears every field so you can build the schema manually from an empty state
* **Reset the schema:** discards inline changes and returns to the original AI-generated schema
Click **Approve** when you are ready.
> **Expected result:** once approved, the AI Agent starts generating the scraper code.
The AI writes the full scraper, including extraction logic, navigation handling, data validation, and error handling. This takes a few minutes.
> **Expected result:** a confirmation popup appears indicating your scraper is ready.
Click **Try it out** to open the Initiate Manually page. Review the collection settings and click **Start** to begin data collection.
You can also choose an alternative initiation method:
* **[Initiate by API](/api-reference/scraper-studio-api/ai-flow/overview):** trigger the scraper programmatically without opening the control panel
* **Schedule:** run the scraper on a daily, weekly, or custom interval
> **Expected result:** the scraper collects data. Monitor progress from the Runs dashboard and download results in JSON, NDJSON, CSV, or XLSX once the job completes.
After the scraper is ready, use the [Self-Healing tool](/datasets/scraper-studio/self-healing-tool) to make plain-language changes, or open the code in the Bright Data Scraper Studio IDE to edit it directly.
## What can the AI Agent build?
The AI Agent creates scrapers based on a specific input type and collection goal. It does not crawl an entire domain: passing a homepage URL and asking it to "scrape everything" will not produce useful results. Choose the scraper type that matches your data shape.
### 1. Product page (PDP) scraper
You provide a list of product page URLs. The scraper visits each URL and extracts product-level data (title, price, description, images).
> **Use when:** you already have the URLs of the specific pages you want to scrape.
### 2. Discovery scraper
You provide a category page or listing page URL. The scraper collects data directly from the listing (titles, prices, ratings), without visiting individual product pages.
> **Use when:** you need an overview of items from a category or search results page, and you do not need full product-page detail.
### 3. Discovery + PDP scraper
You provide a category or listing page URL. The scraper first discovers all product URLs on the page, then visits each product page to collect full detail.
> **Use when:** you need complete product data from an entire category, not just the fields visible on the listing page.
### 4. Search scraper
You provide a search keyword. The AI Agent creates either a Discovery or Discovery + PDP scraper based on your stated requirements: it first finds results for the keyword, then collects data from them.
> **Use when:** you do not have specific URLs and want to collect data from a search term.
## Frequently asked questions
Yes. Every scraper the AI Agent generates can be opened in the Bright Data Scraper Studio IDE and edited directly. If you prefer not to write code, use the [Self-Healing tool](/datasets/scraper-studio/self-healing-tool) to request changes in plain language.
The AI Agent generates scrapers that run on Bright Data's proxy and unblocking infrastructure, which handles most anti-bot defenses. For sites that require a logged-in session, build the scraper in the IDE and use `set_session_cookie()` or the authentication pattern that matches the target site.
The AI Agent relies on the context you give it. If the output is off, decline the schema and add more specifics about field names, selectors, or the exact page section where the data lives. You can also use the [Self-Healing tool](/datasets/scraper-studio/self-healing-tool) to refine a generated scraper after the fact.
## Related
Build a scraper by writing JavaScript directly
Update a generated scraper with plain-language prompts
# Web scraping basics in Scraper Studio
Source: https://docs.brightdata.com/datasets/scraper-studio/basics-of-web-scraping
Learn core web scraping concepts used in Bright Data Scraper Studio: navigation, parsing, 2 worker types, stages and handling scale-related blocking.
Every scraper you build in Bright Data Scraper Studio is made of two parts: **interaction code** that navigates the target site and **parser code** that extracts structured data from the resulting HTML. This page walks through the core concepts so you can read, write, and debug scrapers with confidence.
## Prerequisites
* Basic JavaScript familiarity (variables, functions, async control flow)
* An active [Bright Data account](https://brightdata.com/)
## What are the two phases of a scraper?
A Bright Data Scraper Studio scraper runs in two phases per page:
1. **Interaction** moves through the site to reach the data. That means sending GET or POST requests, following links, handling pagination, submitting forms, and, on a Browser worker, clicking, typing, scrolling, and waiting for elements to appear.
2. **Parsing** reads the page HTML (or a captured JSON payload) and returns a structured record.
Call `parse()` from interaction code once the target page has loaded. That runs the parser code and returns its result. Then call `collect()` to append the record to the final dataset:
```js theme={null}
let data = parse();
collect({
url: new URL(location.href),
title: data.title,
links: data.links,
});
```
The parser code itself uses Cheerio, a jQuery-like API, to extract fields:
```js theme={null}
return {
title: $('h1').text().trim(),
links: $('a').toArray().map(e => new URL($(e).attr('href'))),
};
```
## How do I structure a multi-stage scraper?
Many scrapes need more than one hop, for example "visit a search page, then follow each result URL, then extract each product". Bright Data Scraper Studio handles this with stages. Each stage is a separate browser session, and `next_stage({...})` queues a new input for the next stage.
The example below scrapes an ecommerce search across all result pages, following each listing to its detail page.
**Stage 1, fan out search result pages:**
```js theme={null}
let search_url = `https://example.com/search?q=${input.keyword}`;
navigate(search_url);
let max_page = parse().max_page;
for (let i = 1; i <= max_page; i++) {
let search_page = new URL(search_url);
if (i > 1)
search_page.searchParams.set('page', i);
next_stage({search_page});
}
```
**Stage 2, fan out listing URLs from each result page:**
```js theme={null}
navigate(input.search_page);
let listings = parse().listings;
for (let listing_url of listings)
next_stage({listing_url});
```
**Stage 3, collect the final product record:**
```js theme={null}
navigate(input.listing_url);
collect(parse());
```
The flow:
1. Stage 1 navigates to the search page and parses out the total number of pages.
2. Stage 1 calls `next_stage({search_page})` once per result page. Each call becomes a new stage-2 input.
3. Stage 2 navigates to each result page and parses out all listing URLs.
4. Stage 2 calls `next_stage({listing_url})` once per listing. Each call becomes a new stage-3 input.
5. Stage 3 navigates to each product page and calls `collect(parse())` to add the record to the dataset.
Bright Data Scraper Studio parallelizes stages across workers automatically, so fanning out with `next_stage()` is much faster than walking pagination serially inside one stage.
## Which worker type should I use?
Bright Data Scraper Studio offers two worker types:
* **Browser worker**: a real headless browser. Needed when the page renders data with JavaScript, or when you need to click, scroll, type, or capture network traffic.
* **Code worker**: raw HTTP requests. Faster and cheaper, but cannot run JavaScript or interact with the page.
Start with Code worker. Switch to Browser worker only if the data you need is not in the raw HTTP response. You can change worker type on the same scraper at any time, but browser-only functions (`wait`, `click`, `scroll_*`, `tag_*`, `type`, and more) will throw errors if you run them on a Code worker. See [Worker types](/datasets/scraper-studio/worker-types) for the full list.
## How does Scraper Studio handle blocking and CAPTCHAs?
Scraping at scale runs into the same defenses every time: IP blocks, rate limits, CAPTCHAs, fingerprinting, and bot detection. Bright Data Scraper Studio runs every request through Bright Data's [proxy infrastructure](/proxy-networks/introduction) and [Web Unlocker API](/scraping-automation/web-unlocker), so the scraper:
* Rotates through residential, ISP, datacenter, or mobile IPs based on scraper settings
* Retries blocked requests with a fresh peer session automatically
* Solves common CAPTCHAs when you call `solve_captcha()`
* Mimics real browser fingerprints on Browser worker
You do not manage proxies, sessions, or retries yourself. Focus the scraper code on extracting the data you need and let Bright Data Scraper Studio handle access.
## Related
Choose between Browser worker and Code worker
Full reference for interaction and parser commands
Step-by-step walkthrough of building a scraper in the IDE
Recommended patterns for fast, reliable scrapers
# Scraper Studio IDE best practices
Source: https://docs.brightdata.com/datasets/scraper-studio/best-practices
Bright Data Scraper Studio IDE best practices for dead page detection, request batching, pagination, popups, timeouts, retries and parser code in 8 sections.
This guide shows the coding patterns the Bright Data Scraper Studio team recommends for writing fast, reliable scrapers in the IDE. Each section contrasts a common mistake with the preferred pattern and explains the reason.
## How do I detect dead pages reliably?
When using `navigate()`, add a `dead_page()` condition so the scraper does not retry pages that do not exist. Bright Data Scraper Studio automatically marks HTTP 404 responses as dead, but many sites return 200 with a "not found" template, so you must check for that yourself.
Do not wrap `wait()` in a `try/catch` and call `dead_page()` from the `catch` block. A thrown `wait()` only tells you the selector did not appear within the timeout, not that the page is actually dead.
```js Bad theme={null}
try {
// Waits 30 seconds for 'ok-selector' even if the page is already dead
wait('ok-selector');
} catch (e) {
// You cannot prove the page is dead from a wait timeout alone
dead_page("Page doesn't exist");
}
```
```js Good theme={null}
wait('ok-selector, 404-selector');
if (el_exists('404-selector'))
dead_page();
```
## How do I minimize requests to the browser?
Interaction commands like `click`, `type`, `el_exists`, `el_is_visible`, `wait`, and `wait_visible` each send a request to the browser. Combine selectors into a single call instead of chaining several calls.
```js Bad theme={null}
if (!(el_exists('#price1')) || el_exists('#price2')
|| el_exists('#price3') || el_exists('#discount'))
{
dead_page('No price found');
}
```
```js Good theme={null}
if (!el_exists('#price1, #price2, #price3, #discount'))
dead_page('No price found');
```
## How do I paginate without blocking parallelization?
When a site has paginated results and you want data from every page, call `rerun_stage()` once from the root page for every page you need. Do not call `rerun_stage()` from inside each page as you walk the pagination: that serializes the work and Bright Data Scraper Studio cannot parallelize the requests.
```js Bad theme={null}
navigate(input.url);
let $ = html_load(html());
let next_page_url = $('.next_page').attr('href');
rerun_stage({url: next_page_url});
```
```js Good theme={null}
let url = new URL(input.url);
if (input.page)
url.searchParams.set('page', input.page);
navigate(url);
// input.page only exists when this stage was re-run for a specific page.
// On the root page it's undefined, so we fall through to fan out.
if (input.page)
return;
let $ = html_load(html());
let total_products = +$('.total_pages').text();
let total_pages = Math.ceil(total_products / 20);
total_pages = Math.min(total_pages, 50);
for (let page = 2; page <= total_pages; page++)
rerun_stage({url: input.url, page});
```
## How do I close popups without waiting for them?
Use `close_popup('popup_selector', 'close_button_selector')` to register a background watcher that closes popups whenever they appear. Do not poll for a popup with `wait_visible()` before each interaction: popups can appear at any time, and explicit waits add latency on every step.
```js Bad theme={null}
navigate('https://example.com');
try {
wait_visible('.cky-btn-accept', {timeout: 5000});
click('.cky-btn-accept');
} catch (e) {
console.log('Accept cookies button does not exist, continue');
}
```
```js Good theme={null}
// Runs in the background with no per-step latency.
// The watcher checks for the popup before every interaction automatically.
close_popup('.cky-btn-accept', '.cky-btn-accept');
navigate('https://example.com');
click('.open-product-full-info');
```
## How do I wait for a tagged response before parsing?
When you use `tag_response()` to capture a background API call, follow it with `wait_for_parser_value()` to make sure the request has finished before you read `parser`. Without the wait, the parser may run before the response has arrived and `parser.` will be `undefined`.
```js Bad theme={null}
tag_response('product', /api\/product/);
navigate('https://example.com');
// Parser code:
// The request may not have finished yet; product could be undefined
let {product} = parser;
return product.data;
```
```js Good theme={null}
tag_response('product', /api\/product/);
navigate('https://example.com');
wait_for_parser_value('product');
// Parser code:
let {product} = parser;
return product.data;
```
```js Good (chained navigation) theme={null}
tag_response('product', /api\/product/);
navigate('https://example.com');
// wait_for_parser_value returns the value so you can use it in interaction code
let product = wait_for_parser_value('product');
navigate(product.reviews_url);
tag_html('reviews_html');
// Parser code:
let {product, reviews_html} = parser;
let $ = html_load(reviews_html);
let reviews = $('.review').toArray().map(v => $(v).text());
return {
...product.data,
reviews,
};
```
## Should I throw custom error messages?
No. Let built-in errors from Bright Data Scraper Studio bubble up. They include the selector, the timeout, and the stage, which is more useful than a hand-written "Page not loaded properly". Only throw a custom error when you are checking a domain-specific condition that Bright Data Scraper Studio cannot detect on its own, such as a missing product title.
```js Bad theme={null}
try {
wait('selector1');
// some code
wait('selector2');
// some code
} catch (e) {
throw "Page not loaded properly"
}
```
```js Good theme={null}
// Crawler error: waiting for selector "selector1" failed: timeout 30000ms exceeded
wait('selector1');
// some code
wait('selector2');
// some code
```
```js Good (domain-specific check) theme={null}
if (!el_exists('.product-title'))
throw new Error('Failed to load product page');
```
## How do I handle slow websites without over-extending timeouts?
Keep the default 30-second timeout for most waits. If a specific page is consistently slow, raise it to 45 or 60 seconds. Do not push beyond 60 seconds: a slower peer is usually the cause, and Bright Data Scraper Studio automatically retries with a fresh peer session when a page reports a timeout error.
```js Bad theme={null}
// 120 seconds is too long; the platform cannot recycle a stuck peer
wait('selector', {timeout: 120000});
```
```js Good theme={null}
wait('selector'); // default 30 seconds
wait('selector', {timeout: 45000}); // 45 seconds for slightly slow pages
wait('selector', {timeout: 60000}); // 60 seconds for consistently slow pages
```
## Should I build my own retry loop?
No. Bright Data Scraper Studio handles retries at the job level with a new peer session. A custom retry loop inside your scraper reuses the same session, which is the reason the first attempt failed. Report the error and let Bright Data Scraper Studio retry.
```js Bad theme={null}
let counter = input.counter || 5;
while (counter > 1) {
try {
wait('selector', {timeout: 500});
click('selector');
type('selector');
// some code
break;
} catch (e) {
// rerun_stage creates a new session but this pattern spends extra CPM
return rerun_stage({...input, counter: --counter});
}
}
```
```js Good theme={null}
navigate('https://example.com');
wait('h1');
```
## Should I wrap parser expressions in try/catch?
No. Use optional chaining (`?.`) and nullish coalescing (`??`) instead. A silent `try/catch` around a property access hides real bugs, and a `try/catch` around a `wait()` wastes browser time.
```js Bad theme={null}
try {
const example = obj.prop;
} catch (e) {}
```
```js Bad theme={null}
// Wasting browser time for no reason
try { wait('selector'); } catch (e) {}
try { wait_network_idle({timeout: 8000}); } catch (e) {}
try { wait_page_idle(); } catch (e) {}
```
```js Good theme={null}
const example = object?.prop;
const example2 = object.prop ?? undefined;
const example3 = object.prop ? object.prop : undefined;
```
## How do I extract values from a set of elements in parser code?
Use `toArray().map()` instead of `each()`. It is shorter, returns a real array, and reads as a single expression.
```js Bad theme={null}
const links = [];
$('.card.product-wrapper').each(function (i, el) {
links.push({url: $(this).find('h4 a').attr('href')});
})
return links;
```
```js Good theme={null}
const links = $('.card.product-wrapper').toArray().map(v => ({
url: $(v).find('h4 a').attr('href'),
}));
```
## How do I normalize text in parser code?
Call `$(selector).text_sane()`. Bright Data Scraper Studio adds this custom method to the Cheerio prototype: it collapses every run of whitespace to a single space and trims the result. For numeric extraction, strip non-digits with a regex.
```js Bad theme={null}
$.prototype.clearText = function () {
return this.text().replace(/\s+/g, ' ').trim();
}
```
```js Good theme={null}
let name = $('a').text_sane();
// For digits-only extraction:
let value = +$('a').text().replace(/\D+/g, '');
```
## Related
Full reference for Bright Data Scraper Studio interaction and parser commands
Choose between Browser worker and Code worker for your scraper
# Develop a scraper with the IDE
Source: https://docs.brightdata.com/datasets/scraper-studio/develop-a-scraper
Build a custom web scraper in the Bright Data Scraper Studio IDE in 5 steps: write interaction and parser code in JavaScript, preview output and save.
This guide walks through building a custom web scraper in the Bright Data Scraper Studio IDE from scratch. You will write interaction code that navigates the target site, parser code that extracts structured fields, and then save the scraper to production and configure delivery. By the end, you will have a runnable scraper you can trigger by API, manually, or on a schedule.
**Time to complete:** about 15 to 30 minutes per scraper, depending on site complexity.
## Prerequisites
* An active [Bright Data account](https://brightdata.com/?hs_signup=1\&utm_source=docs) with access to Scraper Studio
* Basic JavaScript familiarity (variables, functions, control flow)
* A target URL you want to scrape
If you prefer describing the scraper in plain language instead of writing code, use the [Scraper Studio AI Agent](/datasets/scraper-studio/ai-agent). The agent generates the same kind of scraper the IDE would produce.
## How do I build a scraper in the IDE?
Go to [brightdata.com/cp/scrapers](https://brightdata.com/cp/scrapers), click **Scraper Studio**, then click **Develop a web scraper (IDE)** to open an empty scraper.
Choose a template from the **Templates** panel if your target site has a matching starter, or start from a blank scraper. Templates are pre-built scrapers for common patterns and sites; they are a fast way to learn the idioms Bright Data Scraper Studio expects.
Interaction code navigates the target site and captures the data you need into the parser. Use the **Interaction code** editor on the left.
A minimal interaction script:
```js theme={null}
navigate(input.url);
wait('.product-title');
let data = parse();
collect(data);
```
For a multi-page scrape, fan out with `next_stage()`:
```js theme={null}
navigate(input.url);
wait('.listing');
let listings = parse().listings;
for (let url of listings)
next_stage({url});
```
See [Scraper Studio functions](/datasets/scraper-studio/functions) for every interaction command.
Parser code reads the HTML of the loaded page and returns a structured record. Use Cheerio's jQuery-like `$` selector.
```js theme={null}
return {
title: $('h1').text_sane(),
price: new Money(+$('.price').text().replace(/\D+/g, ''), 'USD'),
image: new Image($('img.product').attr('src')),
listings: $('.listing a').toArray().map(el => $(el).attr('href')),
};
```
Parser code returns data to whichever interaction function called `parse()`. See [Scraper Studio functions](/datasets/scraper-studio/functions#parser-functions) for the parser helpers Bright Data Scraper Studio provides.
In the **Settings** panel, pick the worker type:
* **Code worker** (faster, cheaper): for static HTML pages and public JSON endpoints
* **Browser worker**: for JavaScript-rendered pages, clicks, scrolling, popups, or captured background traffic
Start with Code worker. Switch to Browser worker if you need any function from the [browser-only list](/datasets/scraper-studio/worker-types#which-functions-are-browser-worker-only).
Click the **Preview** button to run the scraper against a single test input. The results appear in the **Output** tab. Use the **Run log** and **Browser network** tabs to debug failed runs.
> **Expected result:** the Output tab shows a structured record with the fields defined in your parser code.
Click **Save to Production** in the top-right corner. The scraper appears under **My Scrapers** in the control panel and can be triggered by API, manually, or on a schedule.
Open the scraper in **My Scrapers**, click **Delivery preferences**, and choose a destination (API download, webhook, S3, GCS, Azure, SFTP, or email) and a format (JSON, NDJSON, CSV, XLSX). See [Initiate collection and delivery](/datasets/scraper-studio/initiate-collection-and-delivery-options) for every option.
Trigger the first production run. Pick the initiation method that matches your workflow:
* [Initiate by API](/datasets/scraper-studio/initiate-collection-and-delivery-options#how-do-i-trigger-a-scraper-run)
* [Initiate manually](/datasets/scraper-studio/initiate-collection-and-delivery-options#how-do-i-trigger-a-scraper-run)
* [Schedule a scraper](/datasets/scraper-studio/initiate-collection-and-delivery-options#how-do-i-trigger-a-scraper-run)
## Frequently asked questions
Open the scraper in the Bright Data Scraper Studio IDE and check the **Last errors** tab. Every failed input is stored with its exact error message and error code (up to the most recent 1,000 failures). Re-run the failing input from the IDE to reproduce the problem locally, fix the interaction or parser code, and save a new production version.
Yes. Every scraper in Bright Data Scraper Studio, regardless of how it was created, can be opened and edited in the IDE. You can change extraction logic, tweak selectors, add or remove output fields, and change the worker type.
Click **Edit Schema** in the IDE's output schema panel and add the new fields, or return them from parser code and Bright Data Scraper Studio prompts you to update the schema when you save to production.
Use `collect()` to append one record at a time; it is the default way to emit data. Use `set_lines()` when you are collecting records progressively and want the most recent snapshot delivered even if a later step throws an error. Every call to `set_lines()` overrides the previous one. See [collect](/datasets/scraper-studio/functions#collect-append-a-record-to-the-dataset) and [set\_lines](/datasets/scraper-studio/functions#set-lines-set-output-lines-overriding-previous-calls).
## Related
Full reference for interaction and parser commands
Recommended patterns for fast, reliable scrapers
Reference for every panel and control in the IDE
Fix broken scrapers and add fields with plain-language prompts
# Scraper Studio FAQs
Source: https://docs.brightdata.com/datasets/scraper-studio/faqs
FAQs about Bright Data Scraper Studio: 2 scraper types, inputs and outputs, billing, limits, snapshots and reporting issues for custom scraper builds.
This page answers the questions the Bright Data support team hears most often about Scraper Studio. If you need a walkthrough instead of a quick answer, start with [Understanding Scraper Studio](/datasets/scraper-studio/introduction).
## Frequently asked general questions
A Bright Data web scraper is an automated script that collects public web data at scale through Bright Data's proxy and unblocking infrastructure. It returns the collected data in a structured format (JSON, NDJSON, CSV, XLSX) and can deliver it to an API endpoint, webhook, cloud storage, or SFTP. Bright Data maintains hundreds of pre-built scrapers for popular sites in the [Scrapers Library](https://brightdata.com/products/web-scraper).
Bright Data Scraper Studio is a cloud-hosted environment for building custom scrapers. It offers two modes: an AI Agent that generates a scraper from a natural-language description, and an IDE where you write JavaScript directly. Both modes run on the same Bright Data proxy and unblocking infrastructure. See [Understanding Scraper Studio](/datasets/scraper-studio/introduction).
The [Scrapers Library](https://brightdata.com/products/web-scraper) contains pre-built scrapers Bright Data maintains for popular sites such as Amazon, LinkedIn, and Instagram. Bright Data Scraper Studio is the environment you use to build custom scrapers when the site you need is not in the library.
Yes. A single scraper can navigate to any URL you pass in as input. If you need different extraction logic per site, use multiple stages (`next_stage()`) or build one scraper per site.
## Inputs, outputs, and schemas
An input is the parameter set Bright Data Scraper Studio passes into the scraper for a single run. Typical inputs are a URL, a search keyword, a product ID or ASIN, a profile handle, or a date range. Multiple inputs can be passed in one job via CSV upload or the API.
The output is the structured data the scraper returns for an input. Bright Data Scraper Studio delivers output in JSON, NDJSON, CSV, XLSX, or Parquet based on the scraper's delivery preferences.
One input can generate multiple records. For example, if you submit 5 product listing URLs and each listing page contains 20 products, the scraper returns 100 records from 5 inputs. The statistics page counts records, not inputs.
A search scraper takes a keyword as input instead of a URL. Bright Data Scraper Studio runs a search on the target site and extracts data from the result pages. Use a search scraper when you do not have specific URLs to scrape.
A discovery scraper collects data from listing pages such as search results, category pages, or directories. It extracts fields that appear directly on the listing (titles, prices, ratings) and can also collect product URLs or IDs for a follow-up product-page scrape.
When the input or output schema changes, the scraper must be updated to match. If you trigger the scraper before Bright Data has updated it, you will see an `input(output)_schema_incompatible` error.
To trigger the scraper anyway and ignore the schema mismatch, click **Trigger anyway** in the UI or add a parameter to your API request:
* Output schema incompatible: `override_incompatible_schema=1`
* Input schema incompatible: `override_incompatible_input_schema=1`
```bash theme={null}
curl "https://api.brightdata.com/dca/trigger?scraper=ID_COLLECTOR&queue_next=1&override_incompatible_schema=1" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer API_KEY" \
-d '[{"url":"https://targetwebsite.com/product_id/"}]'
```
## Building and editing scrapers
You have three options in Bright Data Scraper Studio:
* **Build with the AI Agent**: describe the data you want in plain language. See [Scraper Studio AI Agent](/datasets/scraper-studio/ai-agent).
* **Build in the IDE**: write JavaScript directly. See [Develop a scraper with the IDE](/datasets/scraper-studio/develop-a-scraper).
* **Request a managed scraper**: Bright Data's data team builds and maintains the scraper for you.
Yes. Every scraper, including ones the AI Agent generated, can be opened and edited in the Bright Data Scraper Studio IDE. You can change extraction logic, modify selectors, add or remove output fields, and tune performance. If you prefer not to write code, use the [Self-Healing tool](/datasets/scraper-studio/self-healing-tool) to request changes in plain language.
Pass the AI Agent a target URL and an optional description of the data you want. The agent asks clarifying questions, generates an output schema for your review, and writes the full scraper code once you approve the schema. You can then run the scraper immediately or schedule recurring runs. See [Scraper Studio AI Agent](/datasets/scraper-studio/ai-agent) for a full walkthrough.
## Running and triggering scrapers
Bright Data Scraper Studio supports three trigger methods:
* **By API**: regular request, queue request, or replace request
* **Manually**: from the control panel
* **On a schedule**: hourly, daily, weekly, or custom
See [Initiate collection and delivery](/datasets/scraper-studio/initiate-collection-and-delivery-options).
A queued request tells Bright Data Scraper Studio to wait until the previous request for the same scraper finishes before starting the next one. Use it when you want serial execution instead of running multiple jobs in parallel.
Bright Data Scraper Studio runs up to **1,000 batch jobs in parallel** per scraper. Additional jobs queue automatically and start as capacity frees up. See [Scraper Studio specifications](/datasets/scraper-studio/specifications) for full limits.
In the Bright Data Scraper Studio dashboard, click the **Bug** icon under **Failed crawls** to open the scraper in the IDE. Failed inputs appear in the **Last errors** tab with the exact error message and error code. Bright Data stores the last 1,000 errors per virtual job so you can re-run failed inputs and diagnose the issue.
## Billing and limits
CPM stands for "cost per mille", meaning 1,000 page loads. Bright Data Scraper Studio bills page loads in CPM units. Current rates are on the [pricing page](https://brightdata.com/pricing/web-scraper/custom).
A billable event is any function that causes Bright Data Scraper Studio to load a page or perform a network request:
* `navigate()`
* `request()`
* `load_more()`
* Media file download (billed per GB, separate from CPM)
The free trial includes **100 records**. A record is one row of output, not one page load, so the trial covers more than 100 page loads for scrapers that return multiple records per input.
## Snapshots and data retention
Snapshot retention depends on the collection type:
* **Batch collections**: 16 days
* **Real-time collections**: 7 days
After that, snapshots are permanently deleted. Bright Data does not recover expired data. Download or export your data before the retention window closes, or configure the scraper to deliver results automatically via webhook, API download, or cloud storage.
## How to report issues
Open the scraper in the Bright Data Scraper Studio control panel and select **Report an issue** from the three-dots menu. Bright Data routes the ticket to a different team based on the issue type:
* **Data** (missing fields, missing records, parsing errors): routed to the scraper engineer. Available only for managed scrapers.
* **Collection and delivery** (incomplete delivery, slow scraper): routed to the support team.
* **Other** (UI issue, product question): routed to the account manager.
Include the affected job ID, a description of the problem, and a screenshot or file if it helps show the issue.
Include:
* The issue category (wrong data, missing records, delivery problem, IDE issue, other)
* A description of the exact problem
* The affected job ID
* A screenshot or file that shows the issue, if possible
Bright Data opens a ticket automatically and the R\&D team handles it.
You receive an email when a Bright Data engineer starts building the scraper and another email when it is ready. You can also track the status on the Scrapers dashboard.
## Related
How Bright Data Scraper Studio works and when to use it
Infrastructure limits, billing, and data retention
# Scraper Studio dashboard features
Source: https://docs.brightdata.com/datasets/scraper-studio/features
Reference for the Bright Data Scraper Studio dashboard across 5 panels: scraper action menu, properties, delivery preferences, batch limits and statistics.
The Bright Data Scraper Studio dashboard is where every production scraper lives. Use it to inspect scraper properties, trigger runs, configure delivery, manage versions, and review collection statistics. This page is a reference for every control you see on the **My Scrapers** page and the statistics drill-down.
## The Scrapers dashboard
Every scraper you create in Bright Data Scraper Studio, whether generated by the AI Agent, built from a template, or written from scratch, appears under **My Scrapers** on the Scrapers dashboard.
### What the dashboard shows
| Control | Description |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Free trial** | The 7-day free trial includes 1,000 page loads |
| **Update available** | A new scraper version is published. If the Update button is hidden, you are on the latest version |
| **Delivery preferences** | Choose file format, delivery destination, and notification settings. See [Initiate collection and delivery](/datasets/scraper-studio/initiate-collection-and-delivery-options) |
| **Parallel job limit** | Bright Data Scraper Studio runs up to 1,000 batch jobs in parallel. Extra jobs queue automatically. See [Specifications](/datasets/scraper-studio/specifications) |
## What does each scraper action menu option do?
Every scraper has a three-dots action menu that exposes the following commands.
| Action | Description |
| --------------------- | --------------------------------------------------------------------------- |
| **Initiate by API** | Trigger a collection through the REST API without opening the control panel |
| **Initiate manually** | Trigger a collection from the control panel |
| **Run on schedule** | Configure a recurring run: hourly, daily, weekly, or custom |
| **Versions** | Review and roll back to an earlier saved version of the scraper |
| **Report an issue** | Submit a platform, scraper, or data quality ticket |
| **Copy link** | Copy a shareable URL for the scraper |
| **Tickets** | View existing support tickets for this scraper |
| **Edit the code** | Open the scraper in the Bright Data Scraper Studio IDE |
| **Disable scraper** | Temporarily deactivate the scraper; can be reactivated later |
| **Delete scraper** | Permanently delete the scraper |
## What properties does each scraper expose?
Every scraper carries metadata that helps you identify it at a glance on the dashboard.
| Property | Values |
| ----------------------------- | ------------------------------------------------------------------------------------------------------ |
| **Maintainer** | `Self-serve` (you maintain the scraper) or `Full-service` (Bright Data maintains it) |
| **Type** | `Search` (keyword input), `PDP` (product page URL input), `Discovery` (category URL input), or `Other` |
| **Use case** | Category tag: Social media, eCommerce, Travel, etc. |
| **Last modified** | Timestamp of the last code edit or schema change |
| **Price per CPM** | Cost for 1,000 page loads |
| **Avg. page-loads per input** | Average number of page loads the scraper uses to process one input |
## How do I trigger a scraper and get results?
To trigger a scraper, use the three initiation options documented on [Initiate collection and delivery](/datasets/scraper-studio/initiate-collection-and-delivery-options):
* [Initiate by API](/datasets/scraper-studio/initiate-collection-and-delivery-options#how-do-i-trigger-a-scraper-run)
* [Initiate manually](/datasets/scraper-studio/initiate-collection-and-delivery-options#how-do-i-trigger-a-scraper-run)
* [Schedule a scraper](/datasets/scraper-studio/initiate-collection-and-delivery-options#how-do-i-trigger-a-scraper-run)
To retrieve results, click the three-dots icon on a scraper row and select **Statistics** to open the collection statistics page, where you can download the data.
Real-time job inputs and outputs are not stored on Bright Data's side and cannot be downloaded from the statistics page. Capture them in your own application when the request returns.
## What does the collection statistics page show?
The statistics page shows per-job metrics for every run of a scraper.
| Column | Description |
| ----------------------- | ----------------------------------------------------------------------- |
| **Job ID** | Unique identifier for the collection job |
| **Trigger** | Who triggered the job and how (API, manual, or schedule) |
| **Inputs** | Number of inputs submitted to the collection |
| **Records** | Number of records the scraper returned |
| **Failed** | Number of pages that failed to crawl |
| **Success rate** | Percentage of inputs that produced successful records |
| **Queued at** | Timestamp when the job entered the queue |
| **Started at** | Timestamp when the scraper started collecting |
| **Finished at** | Timestamp when the scraper finished collecting |
| **Job time** | Total job duration |
| **Estimated time left** | Time remaining for an in-progress job |
| **Queue** | Name of the queue the job is running in (from the trigger's queue name) |
| **Usage** | Total page loads consumed by the job |
The three-dots menu on each job row exposes download, re-run, and report actions.
## Related
Configure trigger methods, file formats, and delivery destinations
Reference for every panel and control in the IDE
# Scraper Studio functions reference
Source: https://docs.brightdata.com/datasets/scraper-studio/functions
Reference for Bright Data Scraper Studio interaction and parser functions: navigation, waits, selectors and parser helpers with parameters and 30+ examples.
This reference documents every function available in Bright Data Scraper Studio's IDE: the interaction code that controls a browser session, and the parser code that turns HTML into structured records. Each function lists its parameters, return value, and a runnable example.
Functions marked with **⭐** work only with the Browser worker and throw an error when called from a Code worker. See [Browser-only functions](#browser-only-functions) for the full list.
## How is Scraper Studio code organized?
A Bright Data Scraper Studio scraper uses two code types:
| Code type | Role | Language and libraries |
| ---------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------- |
| Interaction code | Navigates the target site: URL requests, clicks, scrolls, waits, and background traffic capture | JavaScript + Bright Data browser commands |
| Parser code | Extracts and structures data from the HTML returned by interaction code | JavaScript + Cheerio (`$`) |
You move data from one to the other with `parse()` (which runs the parser) and `collect()` (which appends a record to the final dataset).
## Interaction functions
Interaction functions run in the scraper's main JavaScript context and drive the browser or HTTP client. Use them to navigate, wait for elements, interact with the page, capture network traffic, and hand off data to the parser.
### Global objects
| Name | Type | Description |
| ---------- | ------ | -------------------------------------------------------------------------------------------------------------------------- |
| `input` | object | Input for the current stage, set by the trigger or by a previous `next_stage()`/`run_stage()`/`rerun_stage()` call. |
| `job` | object | Metadata about the current job (for example `job.created`, the job start timestamp). |
| `location` | object | Info about the current browser location. Field: `href`. |
| `parser` | object | Values captured by `tag_response`, `tag_script`, and related tagging functions, available after `wait_for_parser_value()`. |
```js theme={null}
navigate(input.url);
let {created} = job;
console.log('current url', location.href);
```
### Navigation
#### `navigate`, Load a URL in the browser
Navigates the browser to a URL. A 404 status throws a `dead_page` error by default; override with `allow_status`.
**Parameters**
| Parameter | Type | Required | Default | Description |
| ------------------ | ------------- | -------- | ------- | -------------------------------------------------------------- |
| `url` | string or URL | Yes | , | Target URL |
| `opt.wait_until` | string | No | `load` | `load`, `domcontentloaded`, `networkidle0`, or `networkidle2` |
| `opt.timeout` | number | No | `30000` | Navigation timeout in milliseconds |
| `opt.referer` | string | No | , | `Referer` header to send |
| `opt.allow_status` | number\[] | No | `[]` | HTTP status codes to accept without throwing |
| `opt.fingerprint` | object | No | , | Override browser fingerprint (`screen.width`, `screen.height`) |
```js theme={null}
navigate(input.url);
navigate('https://example.com');
navigate('https://example.com', {wait_until: 'domcontentloaded'});
navigate('https://example.com', {referer: 'https://google.com'});
navigate('https://example.com', {timeout: 45000});
navigate('https://example.com', {allow_status: [404]});
navigate('https://example.com', {
fingerprint: {screen: {width: 400, height: 400}},
});
```
#### `request`, Make a direct HTTP request
Sends an HTTP request without using a browser. Use on Code worker, or on Browser worker when you want to bypass the browser.
**Parameters**
| Parameter | Type | Required | Description |
| ------------------ | ---------------- | -------- | ---------------------------------------------------------------- |
| `url` \| `options` | string or object | Yes | URL string, or an object with `url`, `method`, `headers`, `body` |
```js theme={null}
let res = request('https://www.example.com');
let res = request({
url: 'https://www.example.com',
method: 'POST',
headers: {'Content-type': 'application/json'},
body: {hello: 'world'},
});
```
#### `next_stage`, Queue input for the next stage
Runs the next stage of the scraper in a new browser session with the given input.
**Parameters**
| Parameter | Type | Required | Description |
| --------- | ------ | -------- | ------------------------------------- |
| `input` | object | Yes | Input object passed to the next stage |
```js theme={null}
next_stage({url: 'https://example.com', page: 1});
```
#### `run_stage`, Run a specific stage
Runs a named stage of the scraper in a new browser session.
**Parameters**
| Parameter | Type | Required | Description |
| --------- | ------ | -------- | --------------------------------- |
| `stage` | number | Yes | Stage index (starts at 1) |
| `input` | object | Yes | Input object passed to that stage |
```js theme={null}
run_stage(2, {url: 'https://example.com', page: 1});
```
#### `rerun_stage`, Re-run the current stage with new input
Runs this stage again with a new input. Use it to fan out work (for example, one re-run per page in a pagination).
```js theme={null}
rerun_stage({url: 'https://example.com/other-page'});
```
#### `load_sitemap`, Read URLs from an XML sitemap
Loads a sitemap XML file and returns the URL list. Supports sitemap indexes and gzip-compressed sitemaps.
**Parameters**
| Parameter | Type | Required | Description |
| ------------- | ------ | -------- | ----------- |
| `options.url` | string | Yes | Sitemap URL |
```js theme={null}
let {pages} = load_sitemap({url: 'https://example.com/sitemap.xml.gz'});
let {children} = load_sitemap({url: 'https://example.com/sitemap-index.xml'});
```
#### `resolve_url`, Follow a URL through redirects
Returns the final URL that the given URL argument leads to.
**Parameters**
| Parameter | Type | Required | Description |
| --------- | ------------- | -------- | -------------- |
| `url` | string or URL | Yes | URL to resolve |
```js theme={null}
let {href} = parse().anchor_elem_data;
collect({final_url: resolve_url(href)});
```
#### `redirect_history`, Get the redirect chain
Returns the history of URL redirects since the last `navigate()` call.
```js theme={null}
navigate('http://google.com');
let redirects = redirect_history();
// ['http://google.com', 'http://www.google.com', 'https://www.google.com/']
```
#### `response_headers`, Read the last response headers
Returns the response headers from the last page load.
```js theme={null}
let headers = response_headers();
console.log('content-type', headers['content-type']);
```
#### `status_code`, Read the last response status
Returns the HTTP status code of the last page load.
```js theme={null}
collect({status_code: status_code()});
```
### Waiting on the page ⭐
All wait functions are Browser worker only.
#### ⭐ `wait`, Wait for an element to appear
**Parameters**
| Parameter | Type | Required | Default | Description |
| ------------- | ------- | -------- | ------- | ---------------------------------------------------- |
| `selector` | string | Yes | , | CSS selector to wait for |
| `opt.timeout` | number | No | `30000` | Timeout in milliseconds |
| `opt.hidden` | boolean | No | `false` | Wait for the element to be hidden instead of visible |
| `opt.inside` | string | No | , | Selector of an iframe to look inside |
```js theme={null}
wait('#welcome-splash');
wait('.search-results .product');
wait('[href^="/product"]');
wait('#welcome-splash', {timeout: 5000});
wait('#welcome-splash', {hidden: true});
wait('#welcome-splash', {inside: '#iframe_id'});
```
#### ⭐ `wait_any`, Wait for any of several conditions
Waits for any matching condition to succeed. Returns when the first selector resolves.
```js theme={null}
wait_any(['#title', '#notfound']);
```
#### ⭐ `wait_visible`, Wait for an element to be visible
**Parameters**
| Parameter | Type | Required | Default | Description |
| ------------- | ------ | -------- | ------- | ----------------------- |
| `selector` | string | Yes | , | CSS selector |
| `opt.timeout` | number | No | `30000` | Timeout in milliseconds |
```js theme={null}
wait_visible('#welcome-splash');
wait_visible('#welcome-splash', {timeout: 5000});
```
#### ⭐ `wait_hidden`, Wait for an element to disappear
**Parameters**
| Parameter | Type | Required | Default | Description |
| ------------- | ------ | -------- | ------- | ----------------------- |
| `selector` | string | Yes | , | CSS selector |
| `opt.timeout` | number | No | `30000` | Timeout in milliseconds |
```js theme={null}
wait_hidden('#welcome-splash');
wait_hidden('#welcome-splash', {timeout: 5000});
```
#### ⭐ `wait_for_text`, Wait for text content
Waits for an element on the page to contain the given text.
**Parameters**
| Parameter | Type | Required | Description |
| ---------- | ------ | -------- | ---------------- |
| `selector` | string | Yes | CSS selector |
| `text` | string | Yes | Text to wait for |
```js theme={null}
wait_for_text('.location', 'New York');
```
#### `wait_for_parser_value`, Wait for a parser field to be populated
Use after `tag_response()` or `tag_script()` to wait until the captured data is available.
**Parameters**
| Parameter | Type | Required | Description |
| ------------- | -------- | -------- | ---------------------------------------------------------- |
| `field` | string | Yes | Parser field path to wait on |
| `validate_fn` | function | No | Optional callback returning `true` when the value is valid |
| `opt.timeout` | number | No | Timeout in milliseconds |
```js theme={null}
wait_for_parser_value('profile');
wait_for_parser_value('listings.0.price', v => parseInt(v) > 0, {timeout: 5000});
```
#### ⭐ `wait_network_idle`, Wait until the browser network settles
Waits until the browser network has been idle for a given period.
**Parameters**
| Parameter | Type | Required | Default | Description |
| ------------- | ------ | -------- | ------- | --------------------------------------------------- |
| `opt.timeout` | number | No | `500` | Milliseconds of idleness required |
| `opt.ignore` | array | No | `[]` | Patterns (string or RegExp) for requests to exclude |
```js theme={null}
wait_network_idle();
wait_network_idle({
timeout: 1e3,
ignore: [/long_request/, 'https://example.com'],
});
```
#### ⭐ `wait_page_idle`, Wait until DOM mutations stop
Waits until no changes are made to the DOM tree for a given period.
**Parameters**
| Parameter | Type | Required | Description |
| ------------------ | ------ | -------- | --------------------------------------------- |
| `opt.idle_timeout` | number | No | Milliseconds of stability required |
| `opt.ignore` | array | No | Selectors to exclude from mutation monitoring |
```js theme={null}
wait_page_idle();
wait_page_idle({
ignore: ['.live-clock', '.carousel'],
idle_timeout: 1000,
});
```
### Element interaction ⭐
All interaction functions require Browser worker.
#### ⭐ `click`, Click an element
Clicks an element, waiting for it to appear first.
**Parameters**
| Parameter | Type | Required | Description |
| ----------------- | --------------- | -------- | ------------------------------------------------- |
| `selector` | string or array | Yes | CSS selector or Shadow DOM selector path |
| `opt.coordinates` | `{x, y}` | No | Click the closest match to given page coordinates |
```js theme={null}
click('#show-more');
$('#show-more').click();
// Click the map pin closest to the center of a map
let box = bounding_box('#map');
let center = {x: (box.left + box.right) / 2, y: (box.top + box.bottom) / 2};
click('.map-pin', {coordinates: center});
```
#### ⭐ `right_click`, Right-click an element
Same as `click` but uses the right mouse button.
```js theme={null}
right_click('#item');
```
#### ⭐ `hover`, Hover over an element
Moves the cursor over an element, waiting for it to appear first.
```js theme={null}
hover('#item');
```
#### ⭐ `mouse_to`, Move the cursor to a coordinate
**Parameters**
| Parameter | Type | Required | Description |
| --------- | ------ | -------- | ----------------- |
| `x` | number | Yes | Target X position |
| `y` | number | Yes | Target Y position |
```js theme={null}
mouse_to(0, 0);
```
#### ⭐ `type`, Enter text into an input
Waits for the input to appear, then types the given text.
**Parameters**
| Parameter | Type | Required | Description |
| ------------- | --------------- | -------- | ----------------------------------------------------- |
| `selector` | string | Yes | CSS selector |
| `text` | string or array | Yes | Text to type, or an array of strings and special keys |
| `opt.replace` | boolean | No | Clear existing text before typing |
```js theme={null}
type('#location', 'New York');
type('#location', 'New York', {replace: true});
type('[id$=input-box]', 'search term');
type('#search', ['Some text', 'Enter']);
type('#search', ['Backspace']);
```
#### ⭐ `press_key`, Press a special key
Types special keys like Enter or Backspace in the currently focused input.
```js theme={null}
press_key('Enter');
press_key('Backspace');
```
#### ⭐ `select`, Pick a value from a select element
**Parameters**
| Parameter | Type | Required | Description |
| ---------- | ------ | -------- | ------------------------------------ |
| `selector` | string | Yes | CSS selector of a `