数据充实代理快速入门

正在构建 AI 初创公司？

您可能符合我们的初创计划资格。获得本文所介绍基础设施的全额资助访问权限（最高价值 $20,000）。

构建一个 AI 代理，使用 LangGraph 和 Bright Data 自动研究和提取结构化数据。

本教程让你在不到 10 分钟内从零开始拥有一个工作的 AI 充实代理。

你将构建什么

一个 AI 代理，它：

接收研究主题和 JSON 模式作为输入（例如：“来自纽约的 B2B 企业 CTO”、“Stripe 支付公司”、“可再生能源技术的历史”）
使用 Bright Data SERP API 搜索网络（真实搜索结果、地理定位）
使用 Bright Data Web Unlocker 爬取网站（绕过反爬虫措施）
使用 LLM 提取和结构化数据
验证并返回与你的模式匹配的结构化 JSON

前提条件

一个 Bright Data 账户，具有来自仪表板的 API 密钥
一个 Anthropic 或 OpenAI API 密钥
Python 3.10+

第 1 步：安装依赖

pip install langgraph langchain-brightdata langchain-anthropic

第 2 步：设置环境变量

获取你的 API 密钥：

Bright Data API 密钥 - 从你的仪表板生成
Anthropic API 密钥 - 从控制台获取

export BRIGHT_DATA_API_KEY="your-bright-data-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

第 3 步：创建充实代理

创建一个名为 enrichment_agent.py 的文件，并分三部分构建它：

3.1：导入和状态定义

定义代理的状态，以跟踪研究主题、模式、消息和提取的信息。

"""Simple data enrichment agent using LangGraph + Bright Data."""

import json
from dataclasses import dataclass, field
from typing import Any, Annotated, List, Optional

from langchain_anthropic import ChatAnthropic
from langchain_brightdata import BrightDataSERP, BrightDataUnlocker
from langchain_core.messages import AIMessage, HumanMessage, BaseMessage
from langchain_core.tools import tool
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode


# Agent state tracks topic, schema, conversation history, and final output
@dataclass
class AgentState:
    """State for the enrichment agent."""
    topic: str                                                              # Research topic
    extraction_schema: dict[str, Any]                                       # JSON schema for output
    messages: Annotated[List[BaseMessage], add_messages] = field(default_factory=list)  # Chat history
    info: Optional[dict[str, Any]] = None                                   # Extracted result

3.2：工具定义

配置 Bright Data 工具用于网络搜索和内容爬取。

# --- Tools ---

# SERP tool: Searches Google and returns structured results
serp_tool = BrightDataSERP(
    search_engine="google",
    country="us",
    language="en",
    results_count=5,
    parse_results=True,
)

# Unlocker tool: Scrapes any URL, bypassing anti-bot protection
unlocker_tool = BrightDataUnlocker(
    data_format="markdown",
)


@tool
async def search(query: str) -> str:
    """Search the web for information about a topic."""
    results = await serp_tool.ainvoke(query)
    return json.dumps(results, indent=2)


@tool
async def scrape_website(url: str) -> str:
    """Scrape and extract content from a specific URL."""
    content = await unlocker_tool.ainvoke(url)
    return str(content)[:20000]  # Limit content to avoid token overflow


tools = [search, scrape_website]

3.3：代理图和执行

构建编排搜索 → 爬取 → 提取的 LangGraph 工作流。

# --- Agent ---

# System prompt instructs the LLM on its research task
SYSTEM_PROMPT = """You are a research agent. Your task is to gather information about a topic and extract structured data.

You have access to these tools:
- search: Search the web for information
- scrape_website: Get content from a specific URL
- submit_info: Call this when you have gathered all the required information

Research topic: {topic}

Required information schema:
{schema}

Search for relevant information, scrape important pages, then call submit_info with the extracted data."""


def create_agent():
    """Create the enrichment agent graph."""
    llm = ChatAnthropic(model="claude-sonnet-4-20250514")

    async def call_model(state: AgentState) -> dict:
        """Call the LLM to decide next action or submit results."""
        prompt = SYSTEM_PROMPT.format(
            topic=state.topic,
            schema=json.dumps(state.extraction_schema, indent=2)
        )
        messages = [HumanMessage(content=prompt)] + list(state.messages)

        # Dynamic tool for structured output submission
        info_tool = {
            "name": "submit_info",
            "description": "Submit the extracted information when done researching.",
            "parameters": state.extraction_schema,
        }

        model = llm.bind_tools(tools + [info_tool])
        response = await model.ainvoke(messages)

        # Check if agent is submitting final info
        info = None
        if hasattr(response, 'tool_calls') and response.tool_calls:
            for tc in response.tool_calls:
                if tc["name"] == "submit_info":
                    info = tc["args"]
                    break

        return {"messages": [response], "info": info}

    def route(state: AgentState) -> str:
        """Route: end if info submitted, else continue tool loop."""
        if state.info:
            return "__end__"
        if not state.messages:
            return "agent"

        last_msg = state.messages[-1]
        if isinstance(last_msg, AIMessage) and hasattr(last_msg, 'tool_calls') and last_msg.tool_calls:
            for tc in last_msg.tool_calls:
                if tc["name"] == "submit_info":
                    return "__end__"
            return "tools"
        return "agent"

    # Build the graph: agent ↔ tools loop until info is extracted
    graph = StateGraph(AgentState)
    graph.add_node("agent", call_model)
    graph.add_node("tools", ToolNode(tools))
    graph.add_edge("__start__", "agent")
    graph.add_conditional_edges("agent", route)
    graph.add_edge("tools", "agent")

    return graph.compile()


async def enrich(topic: str, schema: dict) -> dict:
    """Run the enrichment agent and return structured data."""
    agent = create_agent()
    result = await agent.ainvoke({
        "topic": topic,
        "extraction_schema": schema,
    })
    return result.get("info", {})


# --- Example Usage ---
if __name__ == "__main__":
    import asyncio

    schema = {
        "type": "object",
        "properties": {
            "company_name": {"type": "string"},
            "industry": {"type": "string"},
            "headquarters": {"type": "string"},
            "founded": {"type": "string"},
            "key_products": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["company_name", "industry"]
    }

    result = asyncio.run(enrich("Stripe payments company", schema))
    print(json.dumps(result, indent=2))

第 4 步：运行代理

python enrichment_agent.py

预期输出：

{
  "company_name": "Stripe",
  "industry": "Financial Technology / Payments",
  "headquarters": "San Francisco, California",
  "founded": "2010",
  "key_products": [
    "Stripe Payments",
    "Stripe Billing",
    "Stripe Connect",
    "Stripe Atlas"
  ]
}

工作原理

代理接收主题和模式，决定搜索内容
搜索使用 Bright Data SERP API 获取真实搜索结果
爬取使用 Bright Data Web Unlocker 获取页面内容
代理分析内容，继续研究或提交信息
输出是与你的模式匹配的结构化 JSON

自定义示例

不同的提取模式

# Extract competitor information
competitor_schema = {
    "type": "object",
    "properties": {
        "competitors": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "market_position": {"type": "string"},
                    "key_differentiator": {"type": "string"}
                }
            }
        }
    }
}

result = await enrich("Stripe competitors in payment processing", competitor_schema)

地理定位搜索

serp_tool = BrightDataSERP(
    search_engine="google",
    country="de",      # Germany
    language="de",     # German
    results_count=10,
)

使用 OpenAI 代替 Anthropic

from langchain_openai import ChatOpenAI

# Replace the LLM initialization
llm = ChatOpenAI(model="gpt-4o")

后续步骤

LinkedIn 爬取

添加 LinkedIn 档案充实

LangChain 集成

完整的 langchain-brightdata 文档

源代码

GitHub 仓库

完整源代码和其他示例

你现在拥有一个 AI 驱动的数据充实代理！自定义模式以提取你需要的任何结构化数据。

介绍

快速开始

面向 AI 代理

用例

集成

数据充实代理快速入门

正在构建 AI 初创公司？

你将构建什么

前提条件

第 1 步：安装依赖

第 2 步：设置环境变量

第 3 步：创建充实代理

3.1：导入和状态定义

3.2：工具定义

3.3：代理图和执行

第 4 步：运行代理

工作原理

自定义示例

不同的提取模式

地理定位搜索

使用 OpenAI 代替 Anthropic

后续步骤

LinkedIn 爬取

LangChain 集成

源代码

GitHub 仓库

介绍

快速开始

面向 AI 代理

用例

集成

Documentation Index

正在构建 AI 初创公司？

​你将构建什么

​前提条件

​第 1 步：安装依赖

​第 2 步：设置环境变量

​第 3 步：创建充实代理

​3.1：导入和状态定义

​3.2：工具定义

​3.3：代理图和执行

​第 4 步：运行代理

​工作原理

​自定义示例

​不同的提取模式

​地理定位搜索

​使用 OpenAI 代替 Anthropic

​后续步骤

LinkedIn 爬取

LangChain 集成

​源代码

GitHub 仓库

你将构建什么

前提条件

第 1 步：安装依赖

第 2 步：设置环境变量

第 3 步：创建充实代理

3.1：导入和状态定义

3.2：工具定义

3.3：代理图和执行

第 4 步：运行代理

工作原理

自定义示例

不同的提取模式

地理定位搜索

使用 OpenAI 代替 Anthropic

后续步骤

源代码