The Web Archive API allows you to access and retrieve Data Snapshots from Bright Data’s cached data collections in a seamless and efficient method.

To access this API, you will need a Bright Data API token

To initiate a search of our Web Archive, use the following /search endpoint.
Endpoint: POST api.brightdata.com/webarchive/search

Request
POST api.brightdata.com/webarchive/search
{
    filters: {
        max_age?: Duration,
        min_date?: yyyy-mm-dd,
        max_date?: yyyy-mm-dd,
        domain_whitelist?: ['example.com'],
        domain_blacklist?: ['example.com'],
        domain_regex_whitelist?: ['.*.example..*'],
        domain_regex_blacklist?: ['.*example..*'],
        category_whitelist?: ['Automotive'],
        category_blacklist?: ['Automotive'],
        path_regex_whitelist?: ['.*/products/.*'],
        path_regex_blacklist?: ['.*/products/.*'],
        language_whitelist?: ['eng'], //ISO 639-3 letter language codes
        language_blacklist?: ['eng'],
        ip_country_whitelist?: ['us', 'ie', 'in'],
        ip_country_blacklist?: ['mx', 'ae', 'ca']
    }
}

Your search cannot cover a date range of more than 7d. If you need to query a longer period than this, please contact your account manager.

You can run 5 searches per day without triggering a dump. Once you trigger a dump, that search no longer count against your limit.

Get Search Status

To check the status of a specific query that was made.
Endpoint: GET api.brightdata.com/webarchive/search/<search_id>

When successful it will retrieve:

  • The number of entries for your query
  • The estimated size and cost of the full Data Snapshot
GET api.brightdata.com/webarchive/search/<search_id>

Get All Search Statuses

Check the status of all current searches.
Endpoint: GET api.brightdata.com/webarchive/searches

GET api.brightdat.com/webarchive/searches

Deliver Snapshot to Amazon S3 Storage

To use S3 storage delivery, you will first need to do the following:

  • Create an AWS role which gives Bright Data access to your system.
    • During this setup, you will be asked by Amazon for an “external ID” that is used with the role.
    • Your external ID for S3 is your Bright Data Account ID that can be found within Account Settings
  • Once a role is created, you will need to allow our system delivery role to AssumeRole that role.
    • Our system delivery role is: arn:aws:iam::422310177405:role/brd.ec2.zs-dca-delivery

To deliver a specific Snapshot from a specific search_id to S3 storage, use the following /dump endpoint.
Endpoint: POST api.brightdata.com/webarchive/dump

Collect Snapshot via Webhook

Collect a Data Snapshot via webhook from a specific search_id
Endpoint: POST api.brightdata.com/webarchive/dump

{
    search_id: <search_id>,
    max_entries?: 1000000,
    delivery: {
		strategy: 'webhook',
		settings: {
             url: string(),
             auth?: string(), // will be added as an Authorization header
        },
    }
}

Get Status of Data Snapshot

Check the status of a specific Data Snapshot (dump) using the dump_id.
Endpoint: GET api.brightdata.com/webarchive/dump/<dump_id>

GET api.brightdata.com/webarchive/dump/<dump_id>

Get the Status of all Data Snapshots

Endpoint: GET api.brightdata.com/webarchive/dumps

200 OK
[
    {
        dump_id: 'ID',
        status: 'in_progress',
        batches_total: 130,
        batches_uploaded: 29,
        files_total: 1241241251,
        estimate_finish: Date
    },
    {
        dump_id: 'ID',
        status: 'done',
        batches_total: 130,
        files_total: 1241241251,
        files_uploaded: 2412515,
        completed_at: Date
    }
    // ... rest of the dumps
]

High-level process flow diagram