An AI data pipeline is only as good as the text you feed it, and the open web is the richest source of fresh, domain-specific knowledge you can ground a model on. The problem is getting that text in a usable shape: most pages are a tangle of navigation, ads, and JavaScript-rendered content that a plain HTTP fetch never sees. This guide shows you how to build an AI data pipeline with LangChain and Crawlbase, using the Crawling API as your document source so the pages arrive as clean markdown, then splitting, embedding, and querying them with retrieval-augmented generation (RAG).

The shape of the pipeline is simple and runs end to end in Python: Crawlbase fetches and cleans the page, LangChain splits it into chunks and embeds them into a vector store, and at query time you retrieve the most relevant chunks and hand them to an LLM as context. Crawlbase handles proxy rotation, blocking, and rendering so your pipeline code stays focused on the data, not on fighting anti-bot defenses. Everything below is runnable; swap in your own URLs and tokens and you have a working RAG system over live web content.

Why Crawlbase as the LangChain document source

LangChain ships document loaders for files, databases, and a handful of web sources, but loading real web pages at scale is where most pipelines stall. A bare request to a modern site returns either a JavaScript shell with no content or a block page, and even when you get HTML back, it is full of boilerplate that pollutes your embeddings. Garbage chunks mean garbage retrieval, which means an LLM that confidently cites the cookie banner.

The Crawling API solves the acquisition layer cleanly. You send it a URL, it renders the page behind a trusted residential IP, and it can hand back the content as clean markdown instead of raw HTML. That markdown is exactly what you want as a LangChain document: readable prose with headings preserved and the navigation, scripts, and ad markup stripped out. Feeding pre-cleaned markdown into your splitter is the single biggest quality lever in a web-grounded RAG pipeline, and it is the same idea explored in LLM-ready markdown web scraping.

This separation of concerns is what keeps the pipeline maintainable. Crawlbase owns web access: rotating IPs, solving CAPTCHAs, rendering JavaScript, and returning structured output. LangChain owns orchestration: chunking, embedding, retrieval, and the prompt that frames the answer. The model owns reasoning. You can change how you chunk or which model you query without touching how data is fetched, and the reverse holds too.

Markdown over raw HTML

The Crawling API accepts a format=markdown parameter (and a get_markdown helper in the official client) that returns the page as clean markdown rather than HTML. For RAG this matters: markdown keeps headings and lists as structure your splitter can respect, while dropping the boilerplate that would otherwise become noisy, low-value chunks in your vector store.

Architecture: from URL to grounded answer

The pipeline has four stages, each with one job. Acquire: the Crawling API fetches each URL and returns clean markdown. Split: LangChain's text splitter breaks each document into overlapping chunks small enough to embed and retrieve precisely. Embed and store: each chunk is turned into a vector and written to a vector store (we use Chroma locally). Retrieve and generate: at query time you embed the question, pull the closest chunks, and pass them to an LLM as grounding context.

The first three stages are an offline ingestion job you run when your sources change. The fourth runs every time a user asks a question. Keeping ingestion and querying separate is what lets the pipeline scale: you crawl and embed once, then answer many questions cheaply against the stored vectors. The broader pattern, including why cleaning matters before you ever embed, is covered in how to structure and clean web-scraped data for AI and ML.

Set up the project

You need Python 3.10 or newer. Create a virtual environment and install the libraries: the official Crawlbase client for acquisition, the LangChain packages for orchestration, Chroma for the vector store, and the OpenAI integration for embeddings and the chat model.

bash
python -m venv .venv
source .venv/bin/activate

pip install crawlbase langchain langchain-community langchain-openai langchain-chroma

You also need two credentials: a Crawlbase token from your dashboard, and an embedding/LLM provider key (here, an OpenAI key). The crawlbase package gives you the CrawlingAPI client; langchain-chroma wraps the local Chroma store; langchain-openai supplies both the embeddings and the chat model. Export your keys as environment variables so nothing sensitive lives in the code.

bash
export CRAWLBASE_TOKEN="your_crawlbase_token"
export OPENAI_API_KEY="your_openai_key"

Step 1: Fetch clean markdown with the Crawling API

Start with acquisition. The official client exposes a get method that takes a URL and options; passing format=markdown returns the page as clean markdown in the response body. Wrap that in a small function that turns each fetched page into a LangChain Document, carrying the source URL in the metadata so you can cite it later.

python
import os
from crawlbase import CrawlingAPI
from langchain_core.documents import Document

api = CrawlingAPI({"token": os.environ["CRAWLBASE_TOKEN"]})

def load_page(url):
    # format=markdown returns clean markdown, not raw HTML
    response = api.get(url, {"format": "markdown"})
    if response["status_code"] != 200:
        raise RuntimeError(f"Fetch failed for {url}: {response['status_code']}")
    body = response["body"]
    text = body.decode("utf-8") if isinstance(body, bytes) else body
    return Document(page_content=text, metadata={"source": url})

urls = [
    "https://example.com/docs/getting-started",
    "https://example.com/docs/pricing",
]
docs = [load_page(u) for u in urls]
print(f"Loaded {len(docs)} documents")

For JavaScript-heavy pages, add "ajax_wait": "true" and a "page_wait" in milliseconds to the options dict, and use a JavaScript token. Because acquisition is isolated in load_page, swapping those options in does not touch any downstream stage. If a site responds with a non-200 status, the function raises with the code so a bad source surfaces loudly instead of poisoning your store with an error page.

Crawlbase Crawling API

Your RAG pipeline is only as good as the text going in. The Crawling API renders the page behind a rotating residential IP and returns clean markdown in one call, so your chunks are real content instead of nav bars and block pages. Wire it in as your LangChain document source and point it at a few public URLs on the free tier first.

Step 2: Split documents into chunks

Whole pages are too large to embed usefully: a single vector for a long document blurs distinct topics together and hurts retrieval precision. Split each document into overlapping chunks instead. LangChain's RecursiveCharacterTextSplitter tries to break on paragraph and sentence boundaries first, so chunks stay coherent, and because the markdown from Crawlbase preserves headings and lists, those splits land on natural seams.

python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
)

chunks = splitter.split_documents(docs)
print(f"Split into {len(chunks)} chunks")

A chunk_size of about 1000 characters with a 150-character overlap is a sensible default for prose. The overlap carries a little context across boundaries so a fact split across two chunks is not lost. Tune both to your content: denser, more technical pages often retrieve better with smaller chunks, while long-form articles tolerate larger ones. The metadata from load_page is copied onto every chunk automatically, so each one still knows its source URL.

Step 3: Embed and store in a vector database

Now turn each chunk into a vector and persist it. An embedding model maps text to a point in high-dimensional space where semantically similar passages sit close together, which is what makes retrieval by meaning possible. Chroma stores those vectors locally and handles the similarity search; passing persist_directory writes the index to disk so you only pay the embedding cost once.

python
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)
print(f"Stored {len(chunks)} vectors")

This block is the end of ingestion. Run it once when your sources change, not on every query. To reuse the store later, reopen it with Chroma(persist_directory="./chroma_db", embedding_function=embeddings) instead of rebuilding from documents. Chroma is convenient for local development; the same LangChain interface fronts hosted stores like Pinecone or pgvector when you outgrow a single machine, so the rest of your code does not change.

Step 4: Retrieve and generate the answer

With vectors in place, the query path is short. Turn the store into a retriever, embed the user's question, pull the closest chunks, and pass them to a chat model with a prompt that tells it to answer only from the supplied context. LangChain's expression language wires these into one chain.

python
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retriever = vector_store.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_template(
    "Answer using only the context below.\n\n"
    "Context:\n{context}\n\nQuestion: {question}"
)

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = chain.invoke("What does the getting-started guide say about setup?")
print(answer)

Setting k=4 retrieves the four most relevant chunks; raise it for broad questions, lower it for tight ones. A temperature of 0 keeps the model anchored to the retrieved text instead of improvising. Because the prompt restricts the answer to the supplied context, responses stay grounded in what you actually crawled, and since every chunk carries its source metadata, you can surface citations by inspecting the retrieved documents directly with retriever.invoke(question).

Running the whole pipeline

Put the four steps in order in a single script and you have a complete pipeline: load_page over your URLs, split, embed into Chroma, then build the chain and invoke it. The first run crawls and embeds, which takes a moment; subsequent runs that reopen the persisted store answer in well under a second because the expensive work is already done. Add more URLs to the list and re-run ingestion to widen what the system knows.

From here the same structure extends naturally. Schedule the ingestion job to refresh sources on a cadence, point it at sitemaps to crawl whole sections, or swap Chroma for a hosted vector store as your corpus grows. For high-volume crawling you can move acquisition onto the asynchronous Crawling API or drive it from an agent over the Web MCP, and route everything through the Smart AI Proxy when you need IP rotation in front of your own fetcher. The pipeline contract does not change: clean text in, grounded answers out. For more on the extraction side of this, see how AI data extraction works.

Recap

Key takeaways

  • Acquisition is the quality lever. Clean markdown from the Crawling API beats raw HTML because boilerplate becomes noisy chunks that wreck retrieval.
  • Four stages, clear boundaries. Acquire, split, embed, retrieve-and-generate, so you can change any one without touching the others.
  • Chunk with overlap. RecursiveCharacterTextSplitter at ~1000 chars with 150 overlap keeps chunks coherent and facts intact across boundaries.
  • Ingest once, query many. Persist the vector store so the expensive embedding work happens only when sources change.
  • Ground the model. Restrict the prompt to retrieved context and keep temperature low so answers stay anchored to what you crawled.
  • Carry source metadata. Tag every document with its URL so you can cite the exact pages behind each answer.

Frequently Asked Questions (FAQs)

Why use Crawlbase instead of a built-in LangChain web loader?

Built-in loaders assume a page returns usable HTML on a plain request, which modern sites rarely do: they render content in the browser and block automated traffic. The Crawling API renders the page behind a rotating residential IP and returns clean markdown, so your documents are real content rather than empty shells or block pages. That cleanliness directly improves chunk quality and retrieval accuracy.

Should I request HTML or markdown for a RAG pipeline?

Markdown. Pass format=markdown so the page comes back as readable prose with headings and lists preserved and the navigation, scripts, and ad markup stripped. Those structural cues help the splitter break on natural boundaries, and removing boilerplate keeps low-value text out of your vector store. Request HTML only when you need to parse specific elements with selectors instead of embedding the page.

How do I handle JavaScript-heavy pages?

Use a JavaScript token and add ajax_wait and page_wait to the options you pass to api.get. The Crawling API then renders the page in a real browser, waits for asynchronous content, and returns the finished markdown. Because acquisition is isolated in the load_page function, enabling rendering does not affect splitting, embedding, or retrieval downstream.

What chunk size and overlap should I use?

Start at roughly 1000 characters per chunk with 150 characters of overlap for general prose. Smaller chunks improve precision on dense technical content; larger chunks suit long-form articles where context spans paragraphs. The overlap carries a little context across boundaries so a fact split between two chunks is still retrievable. Treat these as defaults and tune against your own retrieval results.

Do I have to use OpenAI for embeddings and the LLM?

No. The pipeline is provider-agnostic by design. Swap OpenAIEmbeddings and ChatOpenAI for any LangChain-supported embedding model and chat model, including local ones, and the splitting, storage, and retrieval code stays the same. Crawlbase sits entirely on the acquisition side, so your choice of model never affects how data is fetched.

How do I keep the knowledge base fresh?

Re-run the ingestion stages (acquire, split, embed) on a schedule against the URLs that change, and reopen the persisted store for queries in between. For large or frequently updated corpora, point the crawl at sitemaps and move acquisition onto the asynchronous Scraper API so you can ingest many pages without blocking your application.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available