Direct Answer: Crawlbase now lets developers scrape web pages as clean Markdown instead of raw HTML or JSON. Add format=md to your Crawling API request to receive Markdown, then add md_readability=true to extract the main readable content before conversion. The result is cleaner web data that can move directly into LLM prompts, embeddings, AI agents, and RAG pipelines with far less preprocessing.
Crawlbase delivers LLM-ready Markdown for clean web AI data through the Crawling API. By adding the format=md parameter, developers can request web pages as Markdown instead of raw HTML. Adding md_readability=true further extracts the main readable content before conversion, reducing menus, scripts, and page clutter. The result is cleaner web data that can move directly into LLM prompts, RAG pipelines, embeddings, and AI agents without a separate HTML cleanup step.
To help developers test it quickly, Crawlbase also provides a ready demo project on GitHub:
ScraperHub/crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data
The demo uses a lightweight Python script that reads your Crawlbase API token, requests a page with Markdown output enabled, then saves the response as a local .md file.
A typical page contains menus, scripts, tracking tags, sidebars, and layout markup that browsers need but models do not. Crawlbase enhances the workflow by returning cleaner content closer to the crawl itself through a practical Markdown output API built for modern AI pipelines.
Table of Contents
- Why Markdown Is Better Than HTML for LLM Pipelines
- How Crawlbase Markdown Output Works
- Which Mode Should You Use?
- Why This Matters for RAG Pipelines
- How Crawlbase Simplifies Your AI Scraping Stack
- Simple Python Demo: Run Crawlbase Markdown Output in Minutes
- What the Demo Script Outputs
- Real Use Cases for LLM-ready Web Scraping
- Why AI Agents Benefit Most
- Start LLM-Ready Web Scraping with Crawlbase
- Frequently Asked Questions
Why Markdown Is Better Than HTML for LLM Pipelines
HTML was built for rendering pages in a browser. Markdown is much closer to what AI systems actually need: readable text with useful structure.
When raw HTML enters an LLM workflow, the model often has to sort through markup, boilerplate, and repeated page elements before it reaches the real content. That means tokens get wasted, chunking becomes messier, embeddings can become less precise, and summaries often need extra cleanup. AI agents can also become less reliable when their web tools return inconsistent or cluttered outputs.
Markdown removes most of that friction while keeping the important structure. Headings stay organized, paragraphs remain readable, lists are preserved, tables are easier to interpret, and links stay useful without being buried in code.
That makes Markdown easier to chunk, embed into a vector database, summarize, inspect manually, and pass directly into prompts or agent workflows.
For teams doing web scraping for AI, the output format is not a small detail. It directly affects downstream quality.
How Crawlbase Markdown Output Works
Crawlbase supports native Markdown responses through the Crawling API.
Simply add the format parameter to your API request:
1 | format=md |
That tells Crawlbase to return Markdown instead of HTML.
To focus on the main page content, add:
1 | md_readability=true |
That enables readability extraction before conversion, helping remove surrounding clutter like menus, sidebars, and footer noise.
Basic cURL request format:
1 | curl "https://api.crawlbase.com/?token=USER_TOKEN&url=https%3A%2F%2Fexample.com&format=md&md_readability=true" |
The result is cleaner LLM-ready web scraping output in one request.
format=md vs md_readability=true: Which Mode to Use?
Both options are useful depending on your workflow.
| Request Mode | Best Use Case |
|---|---|
format=md | Preserve broader page context such as menus, related links, navigation |
format=md&md_readability=true | Main content extraction for LLMs, RAG, summarization |
If your goal is embeddings, search, or prompting, start with readability enabled.
If your goal is site structure analysis or broader content capture, plain Markdown may be better.
Why This Matters for RAG Pipelines
RAG, short for Retrieval-Augmented Generation, is a method that gives language models access to external knowledge before generating an answer. Instead of relying only on training data, the model retrieves relevant documents or text chunks first, then uses that context to respond.
A typical RAG workflow is simple: fetch content, split it into chunks, create embeddings, store them in a vector database, retrieve relevant passages later, then send that context to the model.
However, if the original page is filled with junk text, repeated menus, cookie banners, or irrelevant links, that noise gets chunked and indexed alongside the useful content. When that happens, retrieval quality drops and answers become weaker.
Cleaner Markdown gives your pipeline a better starting point. Each chunk is more likely to contain meaningful text instead of layout clutter, which improves retrieval and makes the final response more reliable.
That is why RAG pipeline web data quality matters long before you ever call the model.
How Crawlbase Simplifies Your AI Scraping Stack
Without native Markdown output, many teams build something like this:
1 | fetch HTML |
In this case, a website redesign can break your selectors. A new cookie banner can pollute extracted text. A parser may work well on one page template and fail on another. Suddenly, engineers are spending time fixing cleanup logic instead of improving the AI product itself.
Crawlbase reduces that overhead by moving much of the formatting work closer to the crawl.
With Markdown output enabled, the workflow becomes much simpler:
1 | fetch Markdown with Crawlbase |
This means fewer failure points and more engineering time spent on retrieval quality, prompts, agents, and product features.
Simple Python Demo: Run Crawlbase Markdown Output in Minutes
Crawlbase has a ready demo project on GitHub that shows how to request Markdown output and save it locally.
Repository:
ScraperHub/crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data
This demo keeps the setup intentionally small so developers can test fast.
Step 1: Clone the Demo Repository
1 | git clone https://github.com/ScraperHub/crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data.git |
Step 2: Create a Virtual Environment
Windows PowerShell
1 | python -m venv .venv |
macOS / Linux
1 | python3 -m venv .venv |
Step 3: Install Requirements
1 | pip install -r requirements.txt |
Step 4: Add Your Crawlbase API Token
Windows PowerShell
1 | $env:CRAWLBASE_TOKEN="YOUR_TOKEN" |
macOS / Linux
1 | export CRAWLBASE_TOKEN="YOUR_TOKEN" |
Step 5: Run the Demo
Use the default sample URL:
1 | python crawlbase_markdown_demo.py |
Or crawl your own page:
1 | python crawlbase_markdown_demo.py --url "https://example.com/" |
Step 6: Compare With and Without Readability
To keep broader page content:
1 | python crawlbase_markdown_demo.py --url "https://example.com/" --no-md-readability |
Step 7: Open the Output File
The script saves Markdown locally, usually to:
1 | output/page.md |
Open that file in any editor and inspect the result.
What the Demo Script Outputs
Once the demo runs successfully, it does two things: it saves the Markdown response to a local file and prints a short crawl summary in the terminal.
A typical output looks like this:
1 | Original status: 200 |
This gives you immediate confirmation that the request worked, what the target site returned, and where the Markdown file was saved.
If a page redirects, times out, or returns incomplete content, your pipeline should know before it stores bad data or indexes weak content. Small checks at the ingestion stage can prevent bigger issues later in retrieval and answer quality.

The generated Markdown file can capture product titles, links, category text, navigation labels, and page structure in a readable format. Instead of raw HTML full of scripts and layout code, you get structured text that is easier to inspect and process.
That makes it far more practical for web scraping for AI, internal search tools, or cleaner RAG pipeline web data ingestion.
Real Use Cases for LLM-ready Web Scraping
Markdown output becomes useful anywhere web content needs to become model-ready context.
- Documentation Chatbots: Keep product docs or help centers current by turning documentation pages into clean Markdown chunks for search and retrieval.
- AI Research Agents: Fetch articles, reports, filings, or public resources in a format models can read quickly.
- Competitor Monitoring: Track pricing pages, feature pages, changelogs, and announcements without parsing raw HTML every time.
- Internal Search Systems: Build searchable knowledge indexes using cleaner source material from across the web.
- Summarization Pipelines: Convert long pages into concise summaries with less preprocessing work.
These are practical examples of LLM-ready web scraping where output quality directly affects results.
Why AI Agents Benefit Most
AI agents often perform better when their tools return predictable, readable outputs.
If an agent fetches raw HTML, the model has to work through tags, layout code, and clutter before it can understand the page. That wastes tokens and adds friction.
If the same tool returns readability-filtered Markdown, the model receives something much closer to a usable document from the start.
That makes it easier to summarize pages, extract fields, compare sources, decide next actions, and cite evidence. For teams building autonomous workflows, cleaner tool output often leads to a cleaner agent loop.
Start LLM-Ready Web Scraping with Crawlbase
The web has no shortage of valuable information. The real challenge is turning that information into something AI systems can use efficiently.
Raw HTML often creates unnecessary cleanup work, especially for teams building retrieval systems, AI agents, and search workflows. Crawlbase removes most of that friction by returning clean Markdown directly from the crawl itself.
That makes Crawlbase a practical Markdown-output API for teams focused on LLM-ready and modern web scraping for AI use cases. Instead of spending engineering time stripping HTML, you can move faster on chunking, embeddings, retrieval quality, and product features that matter.
For companies building search systems or retrieval workflows, cleaner source content also leads to stronger RAG pipeline web data from the start.
Start using Crawlbase Markdown output today. Use your 1,000 free requests to test cleaner AI-ready web data on your own URLs.
Frequently Asked Questions (FAQs)
1. What is LLM-ready web scraping?
LLM-ready web scraping means collecting web content in a format that language models can use immediately with minimal cleanup. Instead of raw HTML filled with scripts, styling, and navigation clutter, the output is cleaner, structured text such as Markdown that is easier to chunk, embed, summarize, and pass into prompts.
2. Why is Markdown better than HTML for RAG pipelines?
Markdown is usually better for RAG because it preserves useful structure like headings, lists, links, and tables without unnecessary markup. That creates cleaner chunks, better embeddings, and more relevant retrieval results compared with noisy raw HTML.
3. How do I get Markdown output from Crawlbase?
Use the Crawlbase Crawling API and add format=md to your request. If you also want main-content extraction before conversion, add md_readability=true. This returns cleaner Markdown that can be used directly in AI workflows, search systems, or agent pipelines.










