LangChain · Crawlbase Documentation

Install

pip install langchain-crawlbase

Lightweight install - only langchain-core and requests come along, no other LangChain extras required. Tested on Python 3.9+.

Document loader

Use CrawlbaseLoader anywhere LangChain expects a loader - RAG pipelines, vectorstore ingestion, agent context.

from langchain_crawlbase import CrawlbaseLoader

loader = CrawlbaseLoader(
    urls=["https://example.com/blog/post-1", "https://example.com/blog/post-2"],
    token="YOUR_TOKEN",
)
docs = loader.load()

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vs = Chroma.from_documents(docs, OpenAIEmbeddings())

Agent tool

Expose Crawlbase as an agent tool so the LLM can fetch URLs on demand.

from langchain_openai import ChatOpenAI
from langchain_crawlbase import CrawlbaseTool

tool = CrawlbaseTool(token="YOUR_TOKEN")

llm = ChatOpenAI(model="gpt-4o").bind_tools([tool])
llm.invoke("What's on the homepage of anthropic.com today?")

Retriever

CrawlbaseRetriever fetches a fixed set of seed URLs and returns documents matching a query. Useful when you want live results without standing up a vector store.

from langchain_crawlbase import CrawlbaseRetriever

retriever = CrawlbaseRetriever(
    token="YOUR_TOKEN",
    urls=[
        "https://crawlbase.com/docs/crawling-api",
        "https://crawlbase.com/docs/crawling-api#parameters",
    ],
)
docs = retriever.invoke("how do I render JavaScript pages")

v0.1 uses case-insensitive substring matching against the fetched Markdown. For semantic retrieval, pair CrawlbaseLoader with the vector store of your choice.

JavaScript-rendered pages

For SPAs and pages whose content loads via JavaScript, pass your JavaScript token in the same token parameter - Crawlbase routes the request based on which token you send. No extra flag needed.

Extra Crawlbase parameters

Forward any Crawlbase API parameter (country, device, page_wait, scroll, css_click_selector, cookies, screenshots, etc.) via extra_params.

loader = CrawlbaseLoader(
    token="YOUR_TOKEN",
    urls=["https://example.com"],
    extra_params={"country": "US", "device": "mobile"},
)

Document metadata

Each Document returned by the loader / retriever carries response metadata from Crawlbase:

source: the URL you requested
resolved_url: the final URL after any redirects (when different from source)
pc_status: Crawlbase's final status code
original_status: HTTP status returned by the target site
content_type: response content-type header

Common patterns

RAG over a fresh crawl: use CrawlbaseLoader to grab a few seed URLs, split into chunks, embed, query.
Live web research agent: register CrawlbaseTool alongside a search tool - the agent searches first, then crawls relevant results.
Site monitoring: schedule the loader to re-fetch the same URLs daily and diff into your vector store.