Learn how to use our Archive API for accessing and retrieving data snapshots from Bright Data’s cache, with delivery options to Amazon S3 or via webhook.
Our Archive API allows you to access and retrieve Data Snapshots from Bright Data’s cached data collections in a seamless and efficient method.
To access this API, you will need a Bright Data API key
To initiate a search of our Archive, use the following /search
endpoint.
Endpoint: POST api.brightdata.com/webarchive/search
Here is a brief explanation of each of the parameters you are able to use in your requests:
Parameter | Description |
---|---|
max_age | Limits results to records collected within a specified time range. |
min_date | Returns records collected on or after the specified date. |
max_date | Returns records collected on or before the specified date. |
domain_whitelist | Includes results only from listed domains. |
domain_blacklist | Excludes results from listed domains. |
category_whitelist | Includes results only from specified categories. |
category_blacklist | Excludes results from specified categories. |
path_regex_whitelist | Includes results only matching the specified path regex. |
path_regex_blacklist | Excludes results matching the specified path regex. |
language_whitelist | Includes results only for specific language codes (ISO 639-3). |
language_blacklist | Excludes results for specific language codes. |
ip_country_whitelist | Includes results collected through IPs or peers only from specified countries. |
ip_country_blacklist | Excludes results collected through IPs or peers from specified countries. |
captcha | Return only results with captcha triggered |
robots_block | Return only results with robots block |
You can run up to 100 searches per day without triggering a dump. Once you trigger a dump, that search no longer count against your limit.
To check the status of a specific query that was made.
Endpoint: GET api.brightdata.com/webarchive/search/<search_id>
When successful it will retrieve:
The status code of all three following responses is 200 OK
Check the status of all current searches.
Endpoint: GET api.brightdata.com/webarchive/searches
If your query is matching data within last 72h - your snapshot will start processing/delivering immediately.
If some of your matched data is older than 72h - it needs to be retrieved from a colder archive before delivery and it may take up to 72h.
We recommend using max_age
= 1d
for initial testing.
To use S3 storage delivery, you will first need to do the following:
AssumeRole
that role.
arn:aws:iam::422310177405:role/brd.ec2.zs-dca-delivery
To deliver a specific Snapshot from a specific search_id
to S3 storage, use the following /dump
endpoint.
Endpoint: POST api.brightdata.com/webarchive/dump
Collect a Data Snapshot via webhook from a specific search_id
Endpoint: POST api.brightdata.com/webarchive/dump
If you’re running a linux/macos machine, you can simulate one of our delivery webhooks with the code on this page.
Check the status of a specific Data Snapshot (dump) using the dump_id.
Endpoint: GET api.brightdata.com/webarchive/dump/<dump_id>
The status code of all three following responses is 200 OK
Endpoint: GET api.brightdata.com/webarchive/dumps
Learn how to use our Archive API for accessing and retrieving data snapshots from Bright Data’s cache, with delivery options to Amazon S3 or via webhook.
Our Archive API allows you to access and retrieve Data Snapshots from Bright Data’s cached data collections in a seamless and efficient method.
To access this API, you will need a Bright Data API key
To initiate a search of our Archive, use the following /search
endpoint.
Endpoint: POST api.brightdata.com/webarchive/search
Here is a brief explanation of each of the parameters you are able to use in your requests:
Parameter | Description |
---|---|
max_age | Limits results to records collected within a specified time range. |
min_date | Returns records collected on or after the specified date. |
max_date | Returns records collected on or before the specified date. |
domain_whitelist | Includes results only from listed domains. |
domain_blacklist | Excludes results from listed domains. |
category_whitelist | Includes results only from specified categories. |
category_blacklist | Excludes results from specified categories. |
path_regex_whitelist | Includes results only matching the specified path regex. |
path_regex_blacklist | Excludes results matching the specified path regex. |
language_whitelist | Includes results only for specific language codes (ISO 639-3). |
language_blacklist | Excludes results for specific language codes. |
ip_country_whitelist | Includes results collected through IPs or peers only from specified countries. |
ip_country_blacklist | Excludes results collected through IPs or peers from specified countries. |
captcha | Return only results with captcha triggered |
robots_block | Return only results with robots block |
You can run up to 100 searches per day without triggering a dump. Once you trigger a dump, that search no longer count against your limit.
To check the status of a specific query that was made.
Endpoint: GET api.brightdata.com/webarchive/search/<search_id>
When successful it will retrieve:
The status code of all three following responses is 200 OK
Check the status of all current searches.
Endpoint: GET api.brightdata.com/webarchive/searches
If your query is matching data within last 72h - your snapshot will start processing/delivering immediately.
If some of your matched data is older than 72h - it needs to be retrieved from a colder archive before delivery and it may take up to 72h.
We recommend using max_age
= 1d
for initial testing.
To use S3 storage delivery, you will first need to do the following:
AssumeRole
that role.
arn:aws:iam::422310177405:role/brd.ec2.zs-dca-delivery
To deliver a specific Snapshot from a specific search_id
to S3 storage, use the following /dump
endpoint.
Endpoint: POST api.brightdata.com/webarchive/dump
Collect a Data Snapshot via webhook from a specific search_id
Endpoint: POST api.brightdata.com/webarchive/dump
If you’re running a linux/macos machine, you can simulate one of our delivery webhooks with the code on this page.
Check the status of a specific Data Snapshot (dump) using the dump_id.
Endpoint: GET api.brightdata.com/webarchive/dump/<dump_id>
The status code of all three following responses is 200 OK
Endpoint: GET api.brightdata.com/webarchive/dumps