Build a scraper with the Bright Data CLI

The Bright Data CLI builds Bright Data Scraper Studio scrapers from your terminal in three commands: log in with bdata login, then bdata scraper create and bdata scraper run. Run it on demand with npx, so there is nothing to install. This tutorial walks you through building a Hacker News top-stories scraper end to end. The CLI runs unchanged inside the embedded terminal of any coding agent like Claude Code, Cursor or Codex. Time to complete: about 10 minutes (AI generation runs in the background)

Prerequisites

A Bright Data account (sign up free, no card required)
A terminal. Any embedded terminal works too: Claude Code, Cursor, Codex, VS Code

Run the Bright Data CLI with npx

You do not install anything. Run the CLI on demand with npx, which fetches the latest version each time:

npx -p @brightdata/cli bdata --version

To keep the rest of this tutorial’s commands short, alias bdata for your shell session. Otherwise, prefix any bdata command below with npx -p @brightdata/cli:

alias bdata="npx -p @brightdata/cli bdata"

The CLI is published as @brightdata/cli on npm. The bdata and brightdata commands are interchangeable. Prefer a permanent command instead of npx? Install it globally with npm install -g @brightdata/cli.

Build your first scraper from the terminal

Run bdata login. The CLI opens a browser tab so you can authorize it against your Bright Data account, then stores your API key locally. You do not paste or copy a key.

bdata login

Expected result:

Opening browser for Bright Data authentication...
Logged in successfully. Key: 2e75****12bf
Checking for required zones...
Zone "cli_unlocker" already exists.
Zone "cli_browser" already exists.

The two zones (cli_unlocker and cli_browser) are the Web Unlocker API and Browser API endpoints the CLI uses when running scrapers. Bright Data creates them automatically on first login.

Create the scraper

Pass a target URL and one sentence describing the data you want. Bright Data’s AI Agent generates the output schema, writes the scraper code and returns a Collector ID.

bdata scraper create https://news.ycombinator.com \
  "Extract top stories: title, url, points, author, comment count"

The AI pipeline runs in seven stages, printed live: user_intent_analyzer, planner, collector_maintainer, output_schema_generator, code_generator, input_schema_generator, preview_runner and preview_picker. Typical wall-clock time is 5 to 15 minutes; complex targets can take up to 25 minutes.

Expected result:

Template created: c_mpohus372o5tmid1jk
Triggering AI generation...
Generating scraper...
Step: user_intent_analyzer — polling (attempt 1/600)
...
Done in 280 poll attempts.
{"status":"done","completed_steps":[...],"step":"preview_picker"}

Save the Collector ID (the c_* string). It is the stable handle for every subsequent run, schedule or API call on this scraper.

Run the scraper

Pass the Collector ID and a URL. Use --pretty to format the JSON output.

bdata scraper run c_mpohus372o5tmid1jk https://news.ycombinator.com --pretty

The CLI tries realtime mode first. If the scraper triggers more pages than the realtime limit allows, the CLI silently falls back to batch mode (POST /dca/trigger then poll GET /dca/dataset) and continues. No flag needed.

Expected result: a JSON array, one row per result.

[
  {
    "title": "Last.fm is now independent",
    "url": "https://support.last.fm/t/last-fm-is-now-independent/118591",
    "points": 447,
    "author": "twistslider",
    "comment_count": 131
  },
  {
    "title": "DuckDuckGo search saw 28% more visits after Google said people love AI mode",
    "url": "https://www.pcgamer.com/hardware/duckduckgos-ai-free-search-saw-nearly-28-percent-more-visits-in-the-week-following-googles-insistence-that-people-love-ai-mode/",
    "points": 418,
    "author": "HelloUsername",
    "comment_count": 212
  }
]

How do I use this from Claude Code, Cursor or Codex?

The Bright Data CLI runs inside any embedded terminal as-is. The coding agent is not building the scraper itself; the CLI calls Bright Data’s AI Agent, and the coding agent calls the CLI on your behalf. Two integrations make the CLI feel native inside a coding agent: Pin the Collector ID in the agent’s rules file so the agent re-uses your scraper across sessions instead of building a fresh one every time:

CLAUDE.md CODEX.md

SCRAPER_STUDIO_COLLECTOR_ID=c_mpohus372o5tmid1jk
HACKER_NEWS_SCRAPER_USAGE="bdata scraper run $SCRAPER_STUDIO_COLLECTOR_ID <url> --pretty"

Wire Bright Data’s MCP server into your agent with brightdata add mcp. The MCP server is separate from the Scraper Studio CLI but gives the agent additional scraping tools (scrape_as_markdown, search_engine and others) it can call directly:

brightdata add mcp                # interactive: pick Claude Code, Cursor or Codex

See the Bright Data MCP server quickstart for what the MCP exposes.

What just happened?

Three CLI commands mapped to four Bright Data Scraper Studio API endpoints. Use this table to translate the CLI flow into raw HTTP calls when you are ready to integrate without the CLI:

You ran	Bright Data API endpoint behind it
`bdata login`	Local credential store. Stores the API key from Account Settings.
`bdata scraper create`	`POST /dca/collector` then `POST /dca/collectors/{c_*}/automate_template`
`bdata scraper run` (small input)	`POST /dca/trigger_immediate` then `GET /dca/get_result`
`bdata scraper run` (large input)	`POST /dca/trigger` then poll `GET /dca/dataset?id=j_*`

For a worked example of the underlying API in cURL, Python and Node.js, see the Bright Data Scraper Studio API quickstart. For every endpoint, see the Scraper Studio API reference.

How do I fix a scraper when the site changes?

When a target site is redesigned and a scraper starts returning null or missing fields, fix it in place with bdata scraper heal. Self-healing keeps the same Collector ID, so every trigger, schedule and integration that references the scraper keeps working. The flow is run, inspect, heal, approve, re-run.

Heal the scraper

Pass the Collector ID and a plain-language description of what broke. Keep the prompt under 1,000 characters.

bdata scraper heal c_mpohus372o5tmid1jk \
  "The points and comment_count fields return null since the site redesign. Re-capture them from the new markup." \
  --url https://news.ycombinator.com

By default, heal stops at an approval gate so you can review the proposed fix before it goes live.

Expected result: the command returns an envelope with status: "awaiting_approval" and a preview_result showing sample output from the proposed fix, plus a next_step hint with the command to run next.

Approve or reject the fix

Review the preview. If it looks right, approve it. The fix commits to the existing scraper and the Collector ID does not change.

bdata scraper approve c_mpohus372o5tmid1jk --url https://news.ycombinator.com

To discard the proposed fix and try a sharper prompt instead, reject it:

bdata scraper approve c_mpohus372o5tmid1jk --reject

Expected result: on approval, status advances to done. On reject, the scraper is left unchanged so you can run heal again with a clearer prompt.

Re-run to verify

Run the healed scraper on the same URL and confirm the previously broken fields are populated.

bdata scraper run c_mpohus372o5tmid1jk https://news.ycombinator.com --pretty

For unattended workflows, add --auto-approve to the heal command. It approves the fix automatically and polls through to done in one step. Use it only when you trust the heal without a manual review.

Frequently asked questions

Why did `bdata scraper create` take longer than 10 minutes?

AI generation timing depends on target complexity. Simple single-page scrapers finish in 5 to 10 minutes. Pages with lazy-load, pagination or anti-bot challenges can take 15 to 25 minutes. The CLI polls Bright Data’s AI Flow API every five seconds and prints the current stage, so you can leave it running and check back. No action is needed while you wait.

Why did the CLI switch from realtime to batch mode mid-run?

Realtime mode caps the number of page loads per request. When a scraper triggers more pages than the realtime limit allows, the CLI prints Realtime page limit exceeded, switching to batch mode..., submits the same inputs to POST /dca/trigger, and polls GET /dca/dataset?id=j_* until the snapshot is ready. The switch is automatic and the final JSON shape is identical. See Scraper Studio specifications for the page-load limits.

Why are some rows missing fields like `points` or `comment_count`?

The AI Agent’s generated schema is per-row best-effort, not strict. Jobs posts, “Show HN” entries and very new submissions on Hacker News do not always have a points or comment count yet, so the scraper returns the row with those fields omitted rather than inventing a value. Treat missing fields as null in your own code. To enforce a stricter schema, open the scraper in Scraper Studio or rewrite the schema with the Self-Healing tool.

Can I trigger this scraper from my own code instead of the CLI?

Yes. The Collector ID returned by bdata scraper create (the c_* string) is the same handle the Bright Data Scraper Studio API uses. Pass it to POST /dca/trigger from any HTTP client. See the Bright Data Scraper Studio API quickstart for cURL, Python and Node.js examples.

How do I fix the scraper when the target site changes?

Fix it in place with bdata scraper heal, which keeps the same Collector ID. See How do I fix a scraper when the site changes? above for the full run, heal, approve, re-run flow. Two alternatives:

Control panel: use the Self-Healing tool to describe the fix in plain language.
Direct API: the heal and approve commands wrap a three-call loop. POST /dca/collectors/{c_*}/refactor_template with the prompt, poll GET .../refactor_template/progress until status is pending_answer, then POST .../resume_automation_job to approve or reject. See Trigger Self-Healing and Resume Self-Healing Job. For a Node.js implementation, see the Scraper Studio Self-Healing demo.

The bdata login command requires a browser callback. For headless environments, export your API key as BRIGHTDATA_API_KEY and the CLI uses it directly without a login step:

export BRIGHTDATA_API_KEY="your_api_key_here"
bdata scraper run c_mpohus372o5tmid1jk https://news.ycombinator.com

Copy the key from Account Settings.

Build with the AI Agent

Build the same scraper from the Bright Data control panel instead of the terminal

Scraper Studio API quickstart

Trigger an existing scraper from cURL, Python or Node.js

Self-Healing tool

Fix a scraper with a plain-language prompt when a target site changes

Bright Data CLI overview

Every bdata command, with examples

​Prerequisites

​Run the Bright Data CLI with npx

​Build your first scraper from the terminal

​How do I use this from Claude Code, Cursor or Codex?

​What just happened?

​How do I fix a scraper when the site changes?

​Frequently asked questions

​Related

Build with the AI Agent

Scraper Studio API quickstart

Self-Healing tool

Bright Data CLI overview

Prerequisites

Run the Bright Data CLI with npx

Build your first scraper from the terminal

How do I use this from Claude Code, Cursor or Codex?

What just happened?

How do I fix a scraper when the site changes?

Frequently asked questions

Related