网络存档 API 让您可以顺畅、高效地访问 Bright Data 采集的缓存数据并从中检索数据快照。
如想访问此 API,您需拥有 Bright Data API 令牌
API key
运行搜索请求
要开始搜索我们的网络存档,请使用以下 /search
端点。
端点: POST api.brightdata.com/webarchive/search
POST api.brightdata.com/webarchive/search
{
filters: {
max_age?: Duration,
min_date?: yyyy-mm-dd,
max_date?: yyyy-mm-dd,
domain_whitelist?: ['example.com'],
domain_blacklist?: ['example.com'],
domain_regex_whitelist?: ['.*example..*'],
domain_regex_blacklist?: ['.*example..*'],
category_whitelist?: ['Automotive'],
category_blacklist?: ['Automotive'],
path_regex_whitelist?: ['.*/products/.*'],
path_regex_blacklist?: ['.*/products/.*'],
language_whitelist?: ['eng'], //ISO 639-3 letter language codes
language_blacklist?: ['eng'],
ip_country_whitelist?: ['us', 'ie', 'in'],
ip_country_blacklist?: ['mx', 'ae', 'ca']
}
}
POST api.brightdata.com/webarchive/search
{
filters: {
max_age?: Duration,
min_date?: yyyy-mm-dd,
max_date?: yyyy-mm-dd,
domain_whitelist?: ['example.com'],
domain_blacklist?: ['example.com'],
domain_regex_whitelist?: ['.*example..*'],
domain_regex_blacklist?: ['.*example..*'],
category_whitelist?: ['Automotive'],
category_blacklist?: ['Automotive'],
path_regex_whitelist?: ['.*/products/.*'],
path_regex_blacklist?: ['.*/products/.*'],
language_whitelist?: ['eng'], //ISO 639-3 letter language codes
language_blacklist?: ['eng'],
ip_country_whitelist?: ['us', 'ie', 'in'],
ip_country_blacklist?: ['mx', 'ae', 'ca']
}
}
curl -X POST https://api.brightdata.com/webarchive/search \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
--data '{"filters": {"max_age": "1d", "domain_whitelist": ["example.com"]}}'
Here is a brief explanation of each of the parameters you are able to use in your requests:
Parameter | Description |
---|
max_age | Limits results to records collected within a specified time range. |
min_date | Returns records collected on or after the specified date. |
max_date | Returns records collected on or before the specified date. |
domain_whitelist | Includes results only from listed domains. |
domain_blacklist | Excludes results from listed domains. |
category_whitelist | Includes results from specified categories. |
category_blacklist | Excludes results from specified categories. |
path_regex_whitelist | Includes results matching the specified path regex. |
path_regex_blacklist | Excludes results matching the specified path regex. |
language_whitelist | Includes results for specific language codes (ISO 639-3). |
language_blacklist | Excludes results for specific language codes. |
ip_country_whitelist | Includes results collected through IPs or peers from specified countries. |
ip_country_blacklist | Excludes results collected through IPs or peers from specified countries. |
Your search cannot cover a date range of more than 7d. If you need to query a longer period than this, please contact your account manager.
You can run 5 searches per day without triggering a dump.
Once you trigger a dump, that search no longer count against your limit.
获取搜索状态
查看已进行的特定查询的状态。
端点: GET api.brightdata.com/webarchive/search/<search_id>
调用成功后它将检索:
- 您查询的条目数量
- 完整数据快照的大小和成本的估算值
GET api.brightdata.com/webarchive/search/<search_id>
GET api.brightdata.com/webarchive/search/<search_id>
The status code of all three following responses is 200 OK
{
search_id: "ID",
status: "in_progress"
}
curl https://api.brightdata.com/webarchive/search/$SEARCH_ID \
-H "Authorization: Bearer $API_KEY"
获取所有搜索状态
检查当前所有搜索的状态。
Endpoint: GET api.brightdata.com/webarchive/searches
GET api.brightdat.com/webarchive/searches
GET api.brightdat.com/webarchive/searches
[
{
search_id: "ID",
status: "in_progress"
},
{
search_id: "ID",
status: "done"
},
// ... + rest of the searches and status
}
curl https://api.brightdata.com/webarchive/searches \
-H "Authorization: Bearer $API_KEY"
将快照传送至 Amazon S3 Storage
要使用 S3 存储服务交付数据,您首先需要执行以下操作:
要将特定 search_id
中的特定快照传送至 S3 存储服务平台,请使用以下 /dump
端点。
Endpoint: POST api.brightdata.com/webarchive/dump
POST api.brightdata.com/webarchive/dump
{
search_id: <search_id>,
max_entries?: 1000000, // (optional) limit how many files you purchase
delivery: {
strategy: 's3',
settings: {
bucket: <your_bucket_name>,
assume_role: {
role_arn: <role_you_created_above>,
},
},
},
}
POST api.brightdata.com/webarchive/dump
{
search_id: <search_id>,
max_entries?: 1000000, // (optional) limit how many files you purchase
delivery: {
strategy: 's3',
settings: {
bucket: <your_bucket_name>,
assume_role: {
role_arn: <role_you_created_above>,
},
},
},
}
curl -X POST https://api.brightdata.com/webarchive/dump \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
--data @- <<EOF
{
"search_id": "$SEARCH_ID",
"max_entries": 1000000,
"delivery": {
"strategy": "s3",
"settings": {
"bucket": "$YOUR_BUCKET_NAME",
"assume_role": {
"role_arn": "$YOUR_DELIVERY_ROLE"
}
}
}
}
EOF
通过 Webhook 采集快照
通过 Webhook 从特定的 search_id
采集数据快照
端点: POST api.brightdata.com/cache/dump
{
search_id: <search_id>,
max_entries?: 1000000,
delivery: {
strategy: 'webhook',
settings: {
url: string(),
auth?: string(), // will be added as an Authorization header
},
}
}
{
search_id: <search_id>,
max_entries?: 1000000,
delivery: {
strategy: 'webhook',
settings: {
url: string(),
auth?: string(), // will be added as an Authorization header
},
}
}
curl -X POST https://api.brightdata.com/webarchive/dump \
-H "Authorization: Bearer $API_KEY" \
-H 'Content-Type: application/json' \
--data @- <<EOF
{
"search_id": "$SEARCH_ID",
"max_entries": 1000000,
"delivery": {
"strategy": "webhook",
"settings": {
"url": "$YOUR_WEBHOOK_URL"
}
}
}
EOF
获取数据快照的状态
使用 dump_id 查看特定数据快照(转储)的状态。
端点: GET api.brightdata.com/webarchive/dump/<dump_id>
GET api.brightdata.com/webarchive/dump/<dump_id>
GET api.brightdata.com/webarchive/dump/<dump_id>
The status code of all three following responses is 200 OK
{
dump_id: <id>,
status: 'in_progress',
batches_total: 130,
batches_uploaded: 29,
files_total: 1241241251,
estimate_finish: ISODate
}
curl https://api.brightdata.com/webarchive/dump/$DUMP_ID \
-H "Authorization: Bearer $API_KEY"
获取所有数据快照的状态
端点: GET api.brightdata.com/webarchive/dumps
[
{
dump_id: 'ID',
status: 'in_progress',
batches_total: 130,
batches_uploaded: 29,
files_total: 1241241251,
estimate_finish: Date
},
{
dump_id: 'ID',
status: 'done',
batches_total: 130,
files_total: 1241241251,
files_uploaded: 2412515,
completed_at: Date
}
// ... rest of the dumps
]
[
{
dump_id: 'ID',
status: 'in_progress',
batches_total: 130,
batches_uploaded: 29,
files_total: 1241241251,
estimate_finish: Date
},
{
dump_id: 'ID',
status: 'done',
batches_total: 130,
files_total: 1241241251,
files_uploaded: 2412515,
completed_at: Date
}
// ... rest of the dumps
]
curl https://api.brightdata.com/webarchive/dumps \
-H "Authorization: Bearer $API_KEY"
High-level process flow diagram