Schedule a daily scrape of an Amazon SKU list with the Bright Data Amazon Scraper API, delivered directly to an S3 bucket, orchestrated by a GitHub Actions cron workflow.
You track prices and stock on a list of Amazon SKUs and you need fresh numbers every morning before your team starts work. You don’t want to run a server just for this. You don’t want to babysit a cron job on a laptop.In this tutorial we’ll build exactly that pipeline. You’ll commit a SKU list to a GitHub repo, write a small Python script that triggers the Bright Data Amazon Scraper API against it, wrap the script in a GitHub Actions workflow that runs on a daily cron, and configure Bright Data to deliver the results directly to an S3 bucket. Each morning a fresh JSON file lands in S3, keyed by snapshot ID, ready for your BI pipeline to pick up.No servers, no webhook handlers, no glue code. Just a workflow file, a script and a delivery config.
A skus.json file listing the Amazon product URLs to monitor
A Python script that POSTs the SKU list to the Bright Data Amazon Scraper API
A GitHub Actions workflow that runs the script on a daily schedule
Bright Data configured to deliver each snapshot to your S3 bucket
By the end, you’ll see a new JSON file in S3 every day, each containing fresh price, rating and availability data for every SKU on the list.Estimated time: 30 minutes.
An S3 bucket with Bright Data delivery already configured. Follow Amazon to S3 delivery once, then come back. This tutorial assumes the delivery destination is already saved in your Amazon scraper’s settings.
These are three real product URLs (AirPods Pro 2, Sony WH-1000XM4 headphones and iPad Air M1). Swap them for your own SKUs later.
Keeping the SKU list in the repo means every edit is version-controlled and every change ships through normal pull-request review. Large SKU lists can live in a CSV loaded by the script instead — we’ll mention that in “Next steps.”
import jsonimport osimport sysimport requestsDATASET_ID = "gd_l7q7dkf244hwjntr0" # Amazon products by URLTRIGGER_URL = "https://api.brightdata.com/datasets/v3/trigger"def main() -> int: api_key = os.environ.get("BRIGHT_DATA_API_KEY") if not api_key: print("BRIGHT_DATA_API_KEY environment variable is not set.") return 1 with open("skus.json") as f: urls = json.load(f) payload = [{"url": url} for url in urls] response = requests.post( TRIGGER_URL, params={"dataset_id": DATASET_ID, "format": "json"}, headers={ "Authorization": f"Bearer {api_key}", "Content-Type": "application/json", }, json=payload, timeout=30, ) response.raise_for_status() snapshot_id = response.json().get("snapshot_id") print(f"Triggered scrape for {len(urls)} SKUs. Snapshot: {snapshot_id}") return 0if __name__ == "__main__": sys.exit(main())
Then create requirements.txt:
requests==2.32.3
Two things worth noticing:
The script does not wait for results. It fires the trigger and exits. Bright Data runs the scrape asynchronously and delivers the results directly to S3 via the delivery config you saved in your scraper settings. That’s the whole point: the script is a cheap, stateless trigger.
The API key comes from an environment variable. Never commit keys to a repo. We’ll wire this to GitHub Actions Secrets in Part 4.
The price field is final_price, and it can be null for products that are out of stock or currency-ambiguous listings. Your BI pipeline should handle that case explicitly rather than crashing on a missing key.
Notice that the file is keyed by snapshot_id, not by date. That’s deliberate: each snapshot is immutable, and you can walk the bucket chronologically by listing creation timestamps or by enabling versioning. We’ll discuss naming conventions in “Next steps.”
Now let’s move the trigger off your laptop and onto a daily schedule.Create .github/workflows/daily-scrape.yml:
.github/workflows/daily-scrape.yml
name: Daily Amazon price scrapeon: schedule: - cron: "0 6 * * *" # 06:00 UTC every day workflow_dispatch: # Allows manual runs from the Actions tabjobs: trigger: runs-on: ubuntu-latest steps: - name: Check out the repo uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.11" - name: Install dependencies run: pip install -r requirements.txt - name: Trigger Bright Data scrape env: BRIGHT_DATA_API_KEY: ${{ secrets.BRIGHT_DATA_API_KEY }} run: python trigger_scrape.py
Two key elements:
schedule: cron: "0 6 * * *" runs the job at 06:00 UTC every day. Adjust the cron expression for your timezone. GitHub’s scheduled workflows have no guaranteed precision, but daily runs typically fire within a few minutes of the scheduled time.
workflow_dispatch adds a Run workflow button in the Actions tab so you can kick off the job on demand without waiting for the schedule.
Now add your Bright Data key as a repo secret:
In your GitHub repo, go to Settings > Secrets and variables > Actions
Click New repository secret
Name it BRIGHT_DATA_API_KEY and paste your key
Click Add secret
Treat your Bright Data API key like a password. Never commit it to the repo, never paste it into workflow logs, and rotate it if you suspect it’s been exposed.
Open your repo on GitHub and go to the Actions tab. You should see the Daily Amazon price scrape workflow listed.Click Run workflow > Run workflow to kick it off manually. Within a few seconds a new run appears. Click into it and watch the steps execute. The final step should log:
Triggered scrape for 3 SKUs. Snapshot: sd_mntfn4abcdefghij
Wait a minute, then check S3 again:
aws s3 ls s3://your-bucket-name/amazon/products/
You should now see two snapshot files — the one from Part 3 and the one from the workflow run.From here, GitHub Actions will run the workflow every day at the scheduled time, and a fresh file will appear in S3 each morning. No server, no cron job, no babysitting.