Summarize Web Data with Crawlbase and AI

Reading one web page and writing a quick summary is no trouble. Doing it for a few hundred pages, every morning, is a different job, and it is exactly the kind of work a language model is good at. The hard part was never the summarizing. It is getting clean, readable text off pages that fight back, then feeding a model more text than its context window can hold without losing the thread.

This guide walks through both halves end to end. You will collect pages with the Crawlbase Crawling API as clean markdown, then summarize them with an LLM, and when a page is too long for one call you will chunk it and run a map-reduce summary so nothing falls off the edge. Everything here is runnable Python scoped to public web content. By the end you have a small pipeline that turns a list of URLs into short, consistent summaries you can store, search, or drop into a report. This is the core of how to summarize web data with Crawlbase and AI.

Why summarize web data with Crawlbase and AI

A single page view tells you what one document says right now. The value shows up when you do it at volume: tracking what a set of competitor pages say over time, condensing a feed of articles into a daily digest, or turning a pile of product and review pages into a few lines a human will actually read. An LLM is fast and consistent, so it applies the same criteria to every document and does not get tired on page two hundred.

The model is only as good as the text you hand it, though, and that is where most "AI summarizer" projects quietly break. Modern pages are JavaScript-heavy, wrapped in nav, ads, cookie banners, and boilerplate, and many of them block automated traffic outright. Pipe raw HTML into a model and you waste tokens summarizing markup and menus instead of content. The fix is to separate collection from summarization: let Crawlbase handle rendering, unblocking, and clean extraction, and let the model do what it is good at. For a deeper look at that extraction step, see how AI data extraction works.

How the pipeline fits together

There are two stages, and keeping them separate is what makes the whole thing maintainable.

Collect. The Crawling API fetches each URL behind a trusted IP, renders JavaScript when needed, and returns clean markdown instead of raw HTML. That means the text you summarize is already stripped of nav, scripts, and styling.
Summarize. An LLM reads the markdown and returns a short summary. For pages that fit in the model's context window, that is one call. For long pages, you split the text into chunks, summarize each, then summarize the summaries. That last pattern is map-reduce.

Asking Crawlbase for markdown rather than HTML matters more than it sounds. Markdown keeps headings, lists, and structure while dropping the noise, so the model spends its context budget on meaning. More on that choice in LLM-ready markdown web scraping.

Markdown over raw HTML

The Crawling API can return a page as markdown when you pass format=markdown (or the scraper's markdown option). Always prefer that over raw HTML for summarization. Raw HTML burns tokens on tags and inline styles the model does not need, and the extra noise measurably hurts summary quality. Markdown keeps the structure that helps the model and drops the rest.

Set up the project

You need Python 3 and two accounts: a free Crawlbase account for the token, and an OpenAI account for the model. Create the Crawlbase account first; you get up to 20,000 free API requests: 1,000 on signup, and more as you complete onboarding steps, which is plenty to follow this guide. Copy your Normal request token from the Account Documentation page, and grab an API key from OpenAI as well.

Then create a project folder and install the libraries.

bash

python --version

mkdir web-summarizer && cd web-summarizer
python -m venv .venv && source .venv/bin/activate
pip install requests openai tiktoken

Three dependencies do the work: requests calls the Crawling API, openai is the model client, and tiktoken counts tokens so you know when a page is too big for a single call. Set your two secrets as environment variables so they stay out of the code.

bash

export CRAWLBASE_TOKEN="your_crawlbase_normal_token"
export OPENAI_API_KEY="your_openai_api_key"

Step 1: Fetch a page as clean markdown

Start with collection. You send the Crawling API the target URL and a format=markdown option, and it returns the page already converted to markdown. The function below wraps that call, checks the upstream status, and hands back just the markdown body so the rest of the pipeline never sees raw HTML.

python

import os
import requests

CRAWLBASE_TOKEN = os.environ["CRAWLBASE_TOKEN"]
API_ENDPOINT = "https://api.crawlbase.com/"

def fetch_markdown(url: str) -> str:
    params = {
        "token": CRAWLBASE_TOKEN,
        "url": url,
        "format": "markdown",
    }
    response = requests.get(API_ENDPOINT, params=params, timeout=90)
    response.raise_for_status()
    return response.text

if __name__ == "__main__":
    markdown = fetch_markdown("https://www.crawlbase.com/blog/")
    print(markdown[:800])

Run it and you get the article text as markdown, headings and lists intact, with the page chrome already gone. If your target renders content with JavaScript, the same call works with the JavaScript token instead of the Normal one, so the page is rendered in a real browser before it is converted. Swap the token and you are summarizing single-page-app content with no other code changes.

Crawlbase Crawling API

Clean text in, good summaries out. The Crawling API renders JavaScript, rotates through residential IPs to get past blocks, and returns each page as ready-to-summarize markdown in a single call, so the model never sees nav bars or cookie banners. Start on the free tier and point it at any public URL.

Start free

Step 2: Summarize a short page in one call

When a page comfortably fits inside the model's context window, summarizing it is a single request. The function below takes markdown text and a short instruction, sends it to the model with a low temperature for consistency, and returns the summary string. Keeping temperature low matters here: you want the same input to produce stable output across runs, not creative variety.

python

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from the environment
MODEL = "gpt-4o-mini"

def summarize(text: str, instruction: str) -> str:
    prompt = f"{instruction}\n\n---\n\n{text}"
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
    )
    return response.choices[0].message.content.strip()

SUMMARY_PROMPT = (
    "Summarize the following web page in 4-6 sentences. "
    "Lead with the main point, then the key supporting facts. "
    "Ignore navigation, ads, and boilerplate."
)

if __name__ == "__main__":
    page = fetch_markdown("https://www.crawlbase.com/blog/")
    print(summarize(page, SUMMARY_PROMPT))

That is the entire happy path for a normal article. Fetch markdown, send it with an instruction, print the result. The model handles the language work; Crawlbase handled the data work. The only thing standing between this and a page that does not fit is length, which is the next step.

Step 3: Handle long pages with chunking

Every model has a context window, a hard limit on how much text it can read in one call. Long-form articles, documentation pages, and forum threads can blow past it, and when they do the API rejects the request. The fix is to split the text into chunks that each fit, with a small overlap so a sentence cut in half at a boundary still appears whole in one of the chunks.

Use tiktoken to count tokens, not characters, since the limit is measured in tokens. The function below walks the token list and slices it into windows of a fixed size.

python

import tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o-mini")

def chunk_text(text: str, max_tokens: int = 2000, overlap: int = 150):
    tokens = encoder.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        window = tokens[start:end]
        chunks.append(encoder.decode(window))
        start = end - overlap
    return chunks

Each chunk is now a self-contained piece of text small enough to summarize on its own. A max_tokens of 2,000 leaves comfortable room for the prompt and the response inside a modern context window; lower it if you are on a smaller model. The overlap stops you from losing the boundary sentence between two chunks. With clean markdown from Crawlbase as the input, these chunks are pure content, which keeps the chunk count low and the summaries on topic.

Step 4: Combine chunk summaries with map-reduce

Chunking gives you several pieces; map-reduce turns them back into one answer. The pattern has two phases. In the map phase you summarize each chunk independently, producing a list of partial summaries. In the reduce phase you concatenate those partials and summarize them together into a single final summary. If the combined partials are themselves too long, you reduce again, repeating until one summary remains.

python

MAP_PROMPT = (
    "Summarize this section of a longer document in 3-4 sentences. "
    "Keep concrete facts, names, and numbers."
)

REDUCE_PROMPT = (
    "The following are summaries of consecutive sections of one document. "
    "Combine them into a single coherent summary of 5-7 sentences, "
    "removing repetition and keeping the overall narrative."
)

def summarize_long(text: str) -> str:
    chunks = chunk_text(text)

    if len(chunks) == 1:
        return summarize(chunks[0], SUMMARY_PROMPT)

    partials = [summarize(c, MAP_PROMPT) for c in chunks]
    combined = "\n\n".join(partials)

    while len(encoder.encode(combined)) > 2000:
        partials = [summarize(c, MAP_PROMPT) for c in chunk_text(combined)]
        combined = "\n\n".join(partials)

    return summarize(combined, REDUCE_PROMPT)

This single function now handles any length. A short page takes one call and returns immediately. A long page is mapped, reduced, and reduced again if needed, with the loop guaranteeing the final input always fits. The separate map and reduce prompts matter: the map prompt asks for fact-dense partials so detail survives the first pass, and the reduce prompt asks for a clean narrative so the final summary reads like one piece rather than a stitched list.

Step 5: Run it across many URLs

The two stages now compose into a small pipeline. Give it a list of URLs, fetch each as markdown, summarize each with the length-aware function, and collect the results. Wrap each URL in a try/except so one bad page does not sink the whole batch, and you have something you can point at a feed.

python

import json

URLS = [
    "https://www.crawlbase.com/blog/",
    "https://www.crawlbase.com/blog/ai-data-extraction-how-it-works/",
]

def run_pipeline(urls):
    results = []
    for url in urls:
        try:
            markdown = fetch_markdown(url)
            summary = summarize_long(markdown)
            results.append({"url": url, "summary": summary})
        except Exception as error:
            print(f"Skipped {url}: {error}")
    return results

if __name__ == "__main__":
    output = run_pipeline(URLS)
    print(json.dumps(output, indent=2))

The output is a JSON array of url and summary pairs, ready to write to a file, push to a database, or render into a digest. A trimmed example of what comes back:

json

[
  {
    "url": "https://www.crawlbase.com/blog/",
    "summary": "The Crawlbase blog covers web scraping, proxies, and data extraction, with hands-on tutorials for engineers. Recent posts focus on rendering JavaScript sites, avoiding blocks, and turning pages into clean structured data."
  },
  {
    "url": "https://www.crawlbase.com/blog/ai-data-extraction-how-it-works/",
    "summary": "The article explains how AI models extract structured fields from messy web pages, contrasting rule-based scrapers with model-driven extraction that adapts to layout changes."
  }
]

Practical tips for production

Cache fetched markdown

Collection and summarization fail for different reasons, so do not couple them. Save each page's markdown to disk keyed by URL the moment you fetch it. When you want to re-run with a different prompt or model, you summarize from the cache instead of re-crawling, which is faster and spends no API credits on pages you already have.

Pace and retry your requests

Both APIs can rate-limit a tight loop. Add a short sleep between URLs and wrap the model call in a retry with backoff so a transient error does not lose a page. The Crawling API handles IP rotation and unblocking for you, so the pacing you need here is light, but it is still worth being a polite client.

Pin your prompts and model

Reusable, version-controlled prompts are what make summaries consistent across runs. Keep the map and reduce prompts in one place, pin the model name, and hold temperature low. When you change any of them, treat it as a change to your output, because it is.

Match the tool to the page

Use the Normal token for static pages and the JavaScript token for single-page apps that render content in the browser. If you want fielded data such as price, title, and rating rather than prose, reach for the Crawling API to get structured JSON, then summarize that. And if you are wiring this into an agent or an MCP-based workflow, the Web MCP exposes the same crawling and extraction to your model as tools. For a full agentic build, building an AI data pipeline with LangChain and Crawlbase takes this further.

Recap

Key takeaways

Separate collection from summarization. Crawlbase gets clean text; the model does the language work. Keeping the two stages apart is what makes the pipeline maintainable.
Fetch markdown, not HTML. Pass format=markdown so the model spends its context on content, not nav bars and scripts.
Count tokens, then chunk. Use tiktoken to split long pages into overlapping windows that each fit the context window.
Map-reduce scales to any length. Summarize each chunk, then summarize the summaries, reducing again until one remains.
Cache and pin for production. Save fetched markdown, version your prompts, pin the model, and keep temperature low for consistent output.

Frequently Asked Questions (FAQs)

Why fetch markdown instead of raw HTML for summarization?

Raw HTML is full of tags, scripts, inline styles, nav, and ads that carry no meaning for a summary but still cost tokens. The Crawling API can return a page as markdown, which keeps the headings, lists, and body text while dropping the noise. That means the model spends its limited context on actual content, summaries come out cleaner, and you pay for fewer tokens per page.

What is map-reduce summarization and when do I need it?

Map-reduce is a two-phase pattern for text that is too long for a single model call. In the map phase you summarize each chunk of the document on its own; in the reduce phase you summarize those partial summaries together into one final answer. You need it whenever a page exceeds the model's context window. Short pages skip straight to a single call, which is why the example checks the chunk count first.

How do I pick a chunk size?

Size chunks in tokens, not characters, and leave headroom for the prompt and the response. A 2,000-token chunk works well on modern models with large context windows; drop it for smaller models. Add a small overlap, around 100 to 200 tokens, so a sentence split at a chunk boundary still appears whole in one of the chunks. Count tokens with tiktoken using the same encoding as your model.

Can I summarize JavaScript-rendered pages?

Yes. Swap the Normal token for the JavaScript token in the Crawling API call. It renders the page in a real browser before converting it to markdown, so single-page-app content is present when the model reads it. The rest of the pipeline, the chunking and the map-reduce, does not change at all.

Do I need a paid Crawlbase or OpenAI account to follow this?

No. Crawlbase gives you up to 20,000 free API requests: 1,000 on signup, and more as you complete onboarding steps, which is enough to test this end to end. OpenAI usage is billed per token, and a small model like gpt-4o-mini keeps summarization cheap. Both are fine to start on their free or low-cost tiers before scaling up.

Can I use a different model or provider?

Yes. The pipeline only depends on two things from the model: a chat-style call that takes a prompt and returns text, and a token counter for chunking. Swap the client in the summarize function for any provider you prefer and update the tiktoken encoding to match the model. The collection stage and the map-reduce logic stay exactly the same.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why summarize web data with Crawlbase and AI

How the pipeline fits together

Set up the project

Step 1: Fetch a page as clean markdown

Step 2: Summarize a short page in one call

Step 3: Handle long pages with chunking

Step 4: Combine chunk summaries with map-reduce

Step 5: Run it across many URLs

Practical tips for production

Cache fetched markdown

Pace and retry your requests

Pin your prompts and model

Match the tool to the page

Key takeaways

Frequently Asked Questions (FAQs)

Why fetch markdown instead of raw HTML for summarization?

What is map-reduce summarization and when do I need it?

How do I pick a chunk size?

Can I summarize JavaScript-rendered pages?

Do I need a paid Crawlbase or OpenAI account to follow this?

Can I use a different model or provider?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Beyond Vibe Coding: Scale AI Agents with Infrastructure-First Retrieval

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

The infrastructure brief, in your inbox.