LangChain
Drop-in document loaders, retrievers, and agent tools for LangChain. Crawl any URL straight into your retrieval pipeline or expose Crawlbase as a tool the agent can call.
Install
pip install langchain-crawlbaseLightweight install — only langchain-core and requests come along, no other LangChain extras required. Tested on Python 3.9+.
Document loader
Use CrawlbaseLoader anywhere LangChain expects a loader — RAG pipelines, vectorstore ingestion, agent context.
from langchain_crawlbase import CrawlbaseLoader
loader = CrawlbaseLoader(
urls=["https://example.com/blog/post-1", "https://example.com/blog/post-2"],
token="YOUR_TOKEN",
)
docs = loader.load()
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
vs = Chroma.from_documents(docs, OpenAIEmbeddings())Agent tool
Expose Crawlbase as an agent tool so the LLM can fetch URLs on demand.
from langchain_openai import ChatOpenAI
from langchain_crawlbase import CrawlbaseTool
tool = CrawlbaseTool(token="YOUR_TOKEN")
llm = ChatOpenAI(model="gpt-4o").bind_tools([tool])
llm.invoke("What's on the homepage of anthropic.com today?")Retriever
CrawlbaseRetriever fetches a fixed set of seed URLs and returns documents matching a query. Useful when you want live results without standing up a vector store.
from langchain_crawlbase import CrawlbaseRetriever
retriever = CrawlbaseRetriever(
token="YOUR_TOKEN",
urls=[
"https://crawlbase.com/docs/crawling-api",
"https://crawlbase.com/docs/crawling-api#parameters",
],
)
docs = retriever.invoke("how do I render JavaScript pages")v0.1 uses case-insensitive substring matching against the fetched Markdown. For semantic retrieval, pair CrawlbaseLoader with the vector store of your choice.
JavaScript-rendered pages
For SPAs and pages whose content loads via JavaScript, pass your JavaScript token in the same token parameter — Crawlbase routes the request based on which token you send. No extra flag needed.
Extra Crawlbase parameters
Forward any Crawlbase API parameter (country, device, page_wait, scroll, css_click_selector, cookies, screenshots, etc.) via extra_params.
loader = CrawlbaseLoader(
token="YOUR_TOKEN",
urls=["https://example.com"],
extra_params={"country": "US", "device": "mobile"},
)Document metadata
Each Document returned by the loader / retriever carries response metadata from Crawlbase:
source— the URL you requestedresolved_url— the final URL after any redirects (when different from source)pc_status— Crawlbase's final status codeoriginal_status— HTTP status returned by the target sitecontent_type— response content-type header
Common patterns
- RAG over a fresh crawl:
use
CrawlbaseLoaderto grab a few seed URLs, split into chunks, embed, query. - Live web research agent:
register
CrawlbaseToolalongside a search tool — the agent searches first, then crawls relevant results. - Site monitoring: schedule the loader to re-fetch the same URLs daily and diff into your vector store.

