Skip to main content

Building an AI startup?

You might be eligible for our Startup Program. Get fully funded access to the infrastructure you’re reading about right now (up to $50K value).
Integrating Bright Data with Haystack enhances your RAG pipelines and AI applications with reliable, scalable web data extraction for real-world use cases. The haystack-brightdata Python package is the official Haystack integration for Bright Data, including support for:
  • Bright Data Web Scraper - Extract structured data from 45+ supported websites including Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more using Bright Data’s Dataset API.
  • Bright Data SERP - Query search engines (Google, Bing, Yahoo) with geo-targeting and language customization for real-time search results.
  • Bright Data Unlocker - Access geo-restricted and bot-protected websites, bypass CAPTCHAs and anti-bot measures to extract content in multiple formats.

How to Integrate Bright Data With Haystack

1

Obtain Your Bright Data API Key

2

Install the Bright Data Integration

Install the Bright Data integration package for Haystack by running the following command:
pip install haystack-brightdata
3

Set the environment variable

Set your Bright Data API key as an environment variable:
import os
os.environ["BRIGHT_DATA_API_KEY"] = "your-api-key"
4

Select your preferred Bright Data component

The Bright Data + Haystack integration currently supports:
Extract structured data from 45+ supported websites, including e-commerce, social media, and business intelligence platforms.
from haystack_brightdata import BrightDataWebScraper
import os

# Set your API key
os.environ["BRIGHT_DATA_API_KEY"] = "your-api-key"

# Initialize the scraper
scraper = BrightDataWebScraper()

# Extract Amazon product data
result = scraper.run(
    dataset="amazon_product",
    url="https://www.amazon.com/dp/B08N5WRWNW"
)
print(result["data"])

RAG Pipeline Examples

Product Data RAG Pipeline

Build a Retrieval-Augmented Generation (RAG) pipeline using Bright Data to extract product data from Amazon and answer questions about products:
import os
from haystack import Pipeline, Document
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
from haystack_brightdata import BrightDataWebScraper
import json

# Set API keys
os.environ["BRIGHT_DATA_API_KEY"] = "your-brightdata-api-key"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Initialize components
scraper = BrightDataWebScraper()
document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIChatGenerator()

# Scrape product data from multiple Amazon products
product_urls = [
    "https://www.amazon.com/dp/B0DRWBJDLJ",
    "https://www.amazon.com/dp/B08B8M5JGN",
    "https://www.amazon.com/dp/B09WTTWH1R",
]

documents = []
for url in product_urls:
    result = scraper.run(dataset="amazon_product", url=url)

    # Parse the response
    if isinstance(result["data"], str):
        product_data = json.loads(result["data"])
    else:
        product_data = result["data"]

    if not isinstance(product_data, list):
        product_data = [product_data]

    for product in product_data:
        content_parts = [
            f"Product: {product.get('title', 'N/A')}",
            f"Brand: {product.get('brand', 'N/A')}",
            f"Price: ${product.get('final_price', 'N/A')} {product.get('currency', '')}",
            f"Rating: {product.get('rating', 0)}/5",
            f"Reviews Count: {product.get('reviews_count', 0)}",
        ]

        if product.get('description'):
            content_parts.append(f"Description: {product.get('description')}")

        if product.get('features'):
            features_text = '\n  - '.join(product.get('features', []))
            content_parts.append(f"Features:\n  - {features_text}")

        content = '\n'.join(content_parts)

        documents.append(Document(
            content=content,
            meta={
                "url": product.get('url', url),
                "title": product.get('title', ''),
                "price": product.get('final_price', 0),
                "rating": product.get('rating', 0),
            }
        ))

# Embed and store documents
embeddings = docs_embedder.run(documents)
document_store.write_documents(embeddings["documents"])

# Create RAG pipeline with ChatPromptBuilder
messages = [
    ChatMessage.from_system("You are a helpful shopping assistant."),
    ChatMessage.from_user("""
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
""")
]

prompt_builder = ChatPromptBuilder(template=messages)

# Build and connect pipeline
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

# Ask questions about the products
question = "Which product has the best rating?"
response = pipe.run({
    "embedder": {"text": question},
    "prompt_builder": {"question": question}
})

print(f"Answer: {response['llm']['replies'][0].text}")

SERP + Web Content RAG Pipeline

Use SERP API to find relevant web pages, then use Web Unlocker to extract content for a RAG pipeline:
import os
from haystack import Pipeline, Document
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
from haystack_brightdata import BrightDataSERP, BrightDataUnlocker
import json

# Set API keys
os.environ["BRIGHT_DATA_API_KEY"] = "your-brightdata-api-key"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Initialize components
serp = BrightDataSERP()
unlocker = BrightDataUnlocker(default_output_format="markdown", zone="unblocker")
document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIChatGenerator(model="gpt-4")

# Search for information
search_query = "best practices for machine learning in production"
search_result = serp.run(query=search_query, num_results=5)
search_data = json.loads(search_result["results"])

# Extract URLs from search results
urls = []
for result in search_data.get("organic", [])[:5]:
    url = result.get("url") or result.get("link")
    if url:
        urls.append(url)

# Fetch content from each URL
documents = []
for url in urls:
    try:
        result = unlocker.run(url=url, output_format="markdown")
        content = result["content"]
        documents.append(Document(
            content=content,
            meta={"url": url}
        ))
    except Exception as e:
        print(f"Failed to fetch {url}: {e}")

# Embed and store documents
embeddings = docs_embedder.run(documents)
document_store.write_documents(embeddings["documents"])

# Create RAG pipeline
messages = [
    ChatMessage.from_system("You are a knowledgeable AI assistant."),
    ChatMessage.from_user("""
Context from web sources:
{% for document in documents %}
    Source: {{ document.meta.url }}
    {{ document.content }}
{% endfor %}

Question: {{question}}
""")
]

prompt_builder = ChatPromptBuilder(template=messages)

pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

# Ask questions
question = "What are the main challenges of deploying ML models in production?"
response = pipe.run({
    "embedder": {"text": question},
    "prompt_builder": {"question": question}
})

print(f"Answer: {response['llm']['replies'][0].text}")

Supported Datasets

The BrightDataWebScraper component supports 45+ datasets across multiple categories:
CategoryDatasets
E-commerceamazon_product, amazon_product_reviews, amazon_product_search, walmart_product, walmart_seller, ebay_product, homedepot_products, zara_products, etsy_products, bestbuy_products
LinkedInlinkedin_person_profile, linkedin_company_profile, linkedin_job_listings, linkedin_posts, linkedin_people_search
Instagraminstagram_profiles, instagram_posts, instagram_reels, instagram_comments
Facebookfacebook_posts, facebook_marketplace_listings, facebook_company_reviews, facebook_events
TikToktiktok_profiles, tiktok_posts, tiktok_shop, tiktok_comments
YouTubeyoutube_profiles, youtube_videos, youtube_comments
Search & Commercegoogle_maps_reviews, google_shopping, google_play_store, apple_app_store, zillow_properties_listing, booking_hotel_listings
Business Intelligencecrunchbase_company, zoominfo_company_profile
Otherreuter_news, github_repository_file, yahoo_finance_business, x_posts, reddit_posts
For detailed information about each dataset and its required parameters:
from haystack_brightdata import BrightDataWebScraper

# List all datasets
datasets = BrightDataWebScraper.get_supported_datasets()
for dataset in datasets:
    print(f"{dataset['id']}: {dataset['description']}")

Use Cases

Bright Data’s Haystack integration enables powerful use cases:
  • E-commerce Intelligence: Price monitoring, product data extraction, and competitive analysis
  • Social Media Analytics: Content monitoring and engagement analysis across platforms
  • Business Intelligence: Company research and competitive landscape analysis
  • Search Analysis: SEO/SEM research with geo-targeted search results
  • Content Aggregation: Building RAG pipelines with real-time web data
  • Market Research: Accessing geo-restricted content for global research