If you have ever piped a scraped web page straight into a language model, you already know the problem: the page is mostly noise. Navigation menus, cookie banners, inline scripts, tracking pixels, and layout wrappers all get fed to the model alongside the few paragraphs you actually care about. The model burns tokens reading markup it will never use, and the extra clutter makes retrieval and summaries less reliable.

This guide shows a cleaner path. The Crawling API can return a page as tidy Markdown instead of raw HTML, so you hand your model readable text instead of tag soup. We will cover how to request Markdown output, why Markdown beats HTML for LLM and RAG token budgets, and a small end-to-end pipeline: fetch Markdown, chunk it, then embed or prompt. The phrase to keep in mind throughout is llm ready markdown web scraping, because the output format is what makes the rest of your stack easier.

Why Markdown beats raw HTML for LLMs

HTML was built to render pages in a browser. It carries everything a layout engine needs: nested divs, class names, inline styles, scripts, and ARIA attributes. A model needs almost none of that. When raw HTML enters an LLM workflow, the model has to wade through markup and boilerplate before it reaches the real content, and that has real costs.

Markdown keeps the structure that matters and drops the rest. Headings stay headings, lists stay lists, tables stay legible, and links stay useful without being buried in attributes. The practical wins line up cleanly:

  • Token budget. A typical article page can be several times larger in raw HTML than in its Markdown equivalent once menus, scripts, and wrappers are stripped. Fewer tokens means lower cost per call and more room for actual context inside the model's window.
  • Accuracy. A model reading clean prose is less likely to latch onto a stray nav label or a cookie-consent string than one parsing a wall of divs. Less noise in, fewer wrong conclusions out.
  • Chunking. Markdown headings give you natural split points. You can chunk on ## boundaries and keep semantically related text together instead of slicing through the middle of a sentence at an arbitrary character count.
  • Inspectability. When something goes wrong downstream, you can open a Markdown file and read it. Debugging a 200 KB HTML blob is a different kind of afternoon.

For teams doing web scraping for AI, the output format is not a small detail. It sets the quality ceiling for everything that happens after the fetch. For a wider view of how cleaning shapes model results, see how to structure and clean web scraped data for AI and ML.

How to request Markdown from the Crawling API

Crawlbase returns Markdown natively. You do not bolt on a separate HTML-to-Markdown converter; you ask the Crawling API for Markdown and it does the conversion server-side as part of the crawl.

The control is a single parameter. Add format=md to your request and the API returns Markdown instead of HTML.

bash
curl "https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fexample.com&format=md"

If you want only the main readable content, add md_readability=true. That runs readability extraction before conversion, stripping menus, sidebars, and footer noise so the Markdown contains the article body and little else.

bash
curl "https://api.crawlbase.com/?token=YOUR_TOKEN&url=https%3A%2F%2Fexample.com&format=md&md_readability=true"

Both modes have a place. Plain format=md preserves broader page context such as navigation and related links, which is handy when you are mapping a site's structure. Adding md_readability=true gives you main-content extraction, which is what you want for embeddings, summarization, and RAG. If your goal is to feed a model, start with readability on.

Markdown still needs an unblocked fetch

Markdown output formats whatever the API manages to load. If the target site blocks datacenter traffic or renders its content with JavaScript, you still need the API to get past those defenses first. Pair format=md with a JavaScript token for client-rendered pages, and let the API rotate IPs for protected sites. Clean Markdown of an empty shell is still an empty shell.

Use the Markdown output in a small RAG pipeline

Retrieval-augmented generation, or RAG, gives a model access to outside knowledge before it answers. Instead of relying only on training data, the system retrieves relevant text first, then passes that context to the model. The usual shape is: fetch content, split it into chunks, embed those chunks into a vector store, retrieve the relevant ones at query time, then prompt the model with them.

The quality of that pipeline is decided long before you call the model. If the fetched page is full of repeated menus, cookie banners, and dead links, that noise gets chunked and indexed right alongside the useful text, and retrieval quality drops. Clean Markdown gives every chunk a better chance of holding meaningful content. Here is the fetch step, using readability so each document is mostly body text.

python
import requests

API = "https://api.crawlbase.com/"
TOKEN = "YOUR_TOKEN"

def fetch_markdown(url):
    params = {
        "token": TOKEN,
        "url": url,
        "format": "md",
        "md_readability": "true",
    }
    resp = requests.get(API, params=params, timeout=60)
    resp.raise_for_status()
    return resp.text

With the Markdown in hand, split it into chunks. Because Markdown keeps its headings, you can split on heading boundaries instead of cutting blindly at a character count, which keeps each chunk topically coherent.

python
import re

def chunk_by_heading(markdown, max_chars=1200):
    sections = re.split(r"(?=^#{1,3} )", markdown, flags=re.MULTILINE)
    chunks = []
    for section in sections:
        text = section.strip()
        if not text:
            continue
        if len(text) <= max_chars:
            chunks.append(text)
        else:
            for i in range(0, len(text), max_chars):
                chunks.append(text[i : i + max_chars])
    return chunks

From here the last step is whatever your stack already does: embed each chunk into a vector database for retrieval, or, for a quick test, drop the chunks straight into a prompt. The point is that the input is now clean text, so the embedding and prompt steps inherit that cleanliness.

python
url = "https://example.com/some-article"
markdown = fetch_markdown(url)
chunks = chunk_by_heading(markdown)

# Send the most relevant chunks as context to your model
context = "\n\n".join(chunks[:3])
prompt = f"Answer using only this context:\n\n{context}\n\nQ: ..."
print(len(chunks), "chunks ready for embedding or prompting")

If you want to go deeper on the retrieval and modeling side of this, how AI data extraction works covers how clean inputs flow through to model output.

Crawlbase Crawling API

Skip the HTML-to-Markdown cleanup step entirely. Add format=md to your request and the Crawling API renders the page behind a trusted IP, converts it server-side, and hands back tidy Markdown ready to chunk and embed. Add md_readability=true to strip everything but the main content. Try it on your own URLs on the free tier.

What the cleanup step used to cost you

Without native Markdown output, the common pattern is a brittle preprocessing chain: fetch the HTML, parse the DOM, strip scripts and styles, remove navigation, find the article body, normalize whitespace, then convert to Markdown, and only then chunk and embed. Every link in that chain is a place to fail.

A site redesign can break your body-extraction selectors overnight. A new cookie banner can leak into your extracted text. A parser tuned for one page template can quietly mangle another. The result is engineers spending their time maintaining cleanup logic instead of improving retrieval quality, prompts, or the product itself.

Returning Markdown closer to the crawl collapses that chain. The workflow becomes fetch Markdown, validate the response, chunk, embed. Fewer moving parts means fewer silent failures and more time on the parts of the system that actually move the needle. If you are running this across many sites, the same logic that simplifies one fetch compounds at scale, which is the focus of large-scale web scraping.

Validate the response before you index it

One habit pays for itself: check the response at ingestion time, before bad data reaches your vector store. A page that redirects, times out, or returns a thin body should be caught early, because a weak chunk indexed today becomes a wrong answer next week.

python
def is_usable(markdown, min_chars=200):
    if markdown is None:
        return False
    stripped = markdown.strip()
    # Reject empty shells and near-empty error pages
    return len(stripped) >= min_chars

md = fetch_markdown(url)
if not is_usable(md):
    print("Skipping: thin or empty response")
else:
    chunks = chunk_by_heading(md)
    # proceed to embed / index

This is a small guardrail, but it is the difference between a retrieval system that stays trustworthy and one that slowly fills with junk. Cleaner source content plus a basic sanity check keeps your RAG pipeline web data honest from the first request.

Where LLM-ready Markdown fits best

Markdown output earns its keep anywhere web content has to become model-ready context:

  • Documentation chatbots. Turn help-center and product-doc pages into clean Markdown chunks for search and retrieval, and keep them current with a periodic re-crawl.
  • AI research agents. Fetch articles, reports, and public filings in a format a model can read quickly, without the agent burning its budget on markup.
  • Competitor and market monitoring. Track pricing pages, feature pages, and changelogs as readable text rather than re-parsing raw HTML on every run.
  • Internal search. Build a searchable knowledge index on cleaner source material drawn from across the web.
  • Summarization pipelines. Collapse long pages into concise summaries with far less preprocessing.

Agents in particular benefit. When a tool returns readability-filtered Markdown instead of raw HTML, the model gets something close to a usable document from the start. That makes it easier to summarize, extract fields, compare sources, and decide the next action, which tends to produce a cleaner agent loop. If you are routing agent traffic through rotating IPs, what an AI proxy is explains how that layer fits with tools like the Smart AI Proxy and the Web MCP server. And when the goal is structured fields rather than prose, the Crawling API returns parsed JSON instead.

Keeping the crawl unblocked

Clean output only helps if you can fetch the page in the first place. The sites worth scraping for AI context are often the ones that defend against bots, so the fetch step has to handle blocks as well as formatting. Routing through the Crawling API means IP rotation and rendering are handled server-side, but the broader habits still apply: pace your requests, vary your targets, and read status codes as signal. The full playbook lives in how to scrape websites without getting blocked.

Recap

Key takeaways

  • Markdown is the right shape for models. It keeps headings, lists, and tables while dropping the markup that wastes tokens and confuses retrieval.
  • One parameter switches the format. Add format=md to the Crawling API request; add md_readability=true to extract just the main content.
  • Cleaner input lifts the whole pipeline. Better chunks lead to better embeddings and more relevant retrieval, all decided before you call the model.
  • Server-side conversion removes a brittle chain. Fetch Markdown, validate, chunk, embed, instead of maintaining DOM-stripping and HTML-to-Markdown logic yourself.
  • Validate at ingestion. A quick length check catches empty shells and error pages before they poison your index.
  • Markdown still needs an unblocked fetch. Pair the format with a JS token and IP rotation so the API loads the real content first.

Frequently Asked Questions (FAQs)

What is LLM-ready Markdown web scraping?

It means collecting web content in a format a language model can use immediately, with little or no cleanup. Instead of raw HTML full of scripts, styles, and navigation, the output is clean structured Markdown that is easy to chunk, embed, summarize, and drop into prompts. With Crawlbase you get it by adding format=md to a Crawling API request.

How do I get Markdown output from the Crawlbase Crawling API?

Add format=md to your request and the API returns Markdown instead of HTML. If you also want main-content extraction before conversion, add md_readability=true, which removes menus, sidebars, and footer noise so the Markdown is mostly article body. Both parameters are part of the standard request, so no extra setup is needed.

Why is Markdown better than HTML for RAG pipelines?

Markdown preserves useful structure such as headings, lists, links, and tables without the surrounding markup. That produces cleaner chunks, more precise embeddings, and more relevant retrieval than noisy raw HTML, where boilerplate gets indexed alongside the real content and drags answer quality down.

Does Markdown output reduce token usage with LLMs?

Yes. Stripping scripts, styles, and layout wrappers makes the same page much smaller in token terms, especially with readability enabled. That lowers cost per call and leaves more of the model's context window for the content that matters rather than for markup it would otherwise have to read and discard.

Can I still get the full page context, not just the main article?

Yes. Use format=md on its own, without md_readability=true. Plain Markdown keeps broader page context like navigation and related links, which is useful for site-structure analysis. Turn readability on only when you want the main content isolated for embeddings, summarization, or prompting.

Do I need a JavaScript token to get Markdown from dynamic pages?

If the target page renders its content client-side, yes. Markdown formatting runs on whatever the API loads, so for a JavaScript-rendered page you pass a JS token so the page renders in a real browser first, then request format=md. For static pages a normal token is enough.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available