This update allows for a more granular and streamlined way to request and manage your data collections, facilitating more effective dataset generation according to your specific needs.

Understanding When to Use Each API:

Initial Collection Without Customer-Defined View:

The 3 primary API endpoints serve distinct purposes in the data collection workflow, facilitating a structured and efficient process in obtaining tailored datasets.

Requesting a Collection:

Endpoint: POST https://api.brightdata.com/datasets/request_collection

Parameters:

dataset_id
string
required

Dataset ID

type
string
required

discover_new OR url_collection

inputs
array

Array - json

file
multipart

multipart - csv

Example
curl "https://api.brightdata.com/datasets/request_collection?dataset_id=gd_l1viktl72bvl7bjuj0&type=discover_new" \
    -H "Authorization: Bearer API_TOKEN" \
    -H "Content-Type: application/json" \
    -k -d '[{"id":"user-id"}]' 

Processing may take several minutes, based on the number of inputs. When you request to discover (‘discover_new’), finding all links (PDPs) may take time.


Checking Status of the Collection Above:

Endpoint: GET https://api.brightdata.com/datasets/request_collection

Parameters:

request_id
string
required

Obtain from the previous API.

freshness_ms
string
required

Sets data freshness.

If data is within this period (e.g., req ested 1 wee , collected 5 days ago), 0 new scrape occurs. If data is not fresh, we scrape it now.

  • 1 week: 604,800,000 ms
  • 1 month: 2,592,000,000 ms
Example
curl -k "https://api.brightdata.com/datasets/request_collection?request_id=REQUEST_ID&freshness_ms=2592000000" \
    -H "Authorization: Bearer API_TOKEN" 

Response Indicating Nmber of Records and Freshness Found:

{
    "dataset_id": request_job.dataset_id,
    "total_lines": 100,
    "fresh_count": 30,
    "name": "linkedin_companies custom input",
    "status": "done",
    "request_id": "XXXX",
}

The request is still running:

{
    "total_lines": 100,
    "status": "running",
}

Issue with one (or more) inputs: in this case the url was sent as URL

{
    "request_id": "xxxx",  
    "error": "Validation failed",
    "error_code": "validation",
    "validation_errors": [
        {
            "line": "{\"URL\":\"https://www.tiktok.com/search?q=tjd\"}",
            "index": 1,
            "errors": [
                ["url", "Required field"]
            ]
        }
    ]
}

Initiating a Collection:

Endpoint: POST https://api.brightdata.com/datasets/initiate_collection Parameters:

request_id
string
required

The unique identifier for the collection request you are inquiring about.

freshness_ms
string
required

The time in milliseconds indicating the desired data freshness.

request_id
string
required

The time in milliseconds indicating the desired data freshness.

Example
curl -X POST -k "https://api.brightdata.com/datasets/initiate_collection" \
    -d '{"request_id":"j_ln2x567b2961de0d1x","freshness_ms":2592000000}' \
    -H "Authorization: Bearer API_TOKEN" -H "content-type: application/json"

Collection After Defining a View:


Initiating a Collection:

Endpoint: POST https://api.brightdata.com/datasets/initiate

Parameters:

dataset_id
string
required
view
string
required
type
string
required

discover_new OR url_collection

inputs
array

Array - json

file
multipart

multipart - csv

Example
curl "https://api.brightdata.com/datasets/initiate?dataset_id=XXX_DATASET_ID&type=url_collection&view=XXX_VIEW_ID" \
    -H "Authorization: Bearer API_TOKEN" \
    -H "Content-Type: application/json" \
    -k -d '[{"id":"user-id"}]' 

Dataset will be delivered to the setting configured for this view.

By leveraging these enhanced capabilities, users can now tailor their data collection processes more efficiently, ensuring that the datasets generated are aligned with their project requirements.


How to retreive results of snapshot that was already collected

curl "https://api.brightdata.com/datasets/snapshot/snapshot_id/download" \
    -H "Authorization: Bearer API_TOKEN"