Docs
Log in

Install

pip install langchain-crawlbase

Lightweight install — only langchain-core and requests come along, no other LangChain extras required. Tested on Python 3.9+.

Document loader

Use CrawlbaseLoader anywhere LangChain expects a loader — RAG pipelines, vectorstore ingestion, agent context.

from langchain_crawlbase import CrawlbaseLoader

loader = CrawlbaseLoader(
    urls=["https://example.com/blog/post-1", "https://example.com/blog/post-2"],
    token="YOUR_TOKEN",
)
docs = loader.load()

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vs = Chroma.from_documents(docs, OpenAIEmbeddings())

Agent tool

Expose Crawlbase as an agent tool so the LLM can fetch URLs on demand.

from langchain_openai import ChatOpenAI
from langchain_crawlbase import CrawlbaseTool

tool = CrawlbaseTool(token="YOUR_TOKEN")

llm = ChatOpenAI(model="gpt-4o").bind_tools([tool])
llm.invoke("What's on the homepage of anthropic.com today?")

Retriever

CrawlbaseRetriever fetches a fixed set of seed URLs and returns documents matching a query. Useful when you want live results without standing up a vector store.

from langchain_crawlbase import CrawlbaseRetriever

retriever = CrawlbaseRetriever(
    token="YOUR_TOKEN",
    urls=[
        "https://crawlbase.com/docs/crawling-api",
        "https://crawlbase.com/docs/crawling-api#parameters",
    ],
)
docs = retriever.invoke("how do I render JavaScript pages")

v0.1 uses case-insensitive substring matching against the fetched Markdown. For semantic retrieval, pair CrawlbaseLoader with the vector store of your choice.

JavaScript-rendered pages

For SPAs and pages whose content loads via JavaScript, pass your JavaScript token in the same token parameter — Crawlbase routes the request based on which token you send. No extra flag needed.

Extra Crawlbase parameters

Forward any Crawlbase API parameter (country, device, page_wait, scroll, css_click_selector, cookies, screenshots, etc.) via extra_params.

loader = CrawlbaseLoader(
    token="YOUR_TOKEN",
    urls=["https://example.com"],
    extra_params={"country": "US", "device": "mobile"},
)

Document metadata

Each Document returned by the loader / retriever carries response metadata from Crawlbase:

  • source — the URL you requested
  • resolved_url — the final URL after any redirects (when different from source)
  • pc_status — Crawlbase's final status code
  • original_status — HTTP status returned by the target site
  • content_type — response content-type header

Common patterns

  • RAG over a fresh crawl: use CrawlbaseLoader to grab a few seed URLs, split into chunks, embed, query.
  • Live web research agent: register CrawlbaseTool alongside a search tool — the agent searches first, then crawls relevant results.
  • Site monitoring: schedule the loader to re-fetch the same URLs daily and diff into your vector store.