Build an AI Research Dataset with Web MCP

Most AI workflows are built for retrieval, not research. An agent fetches a page, pulls out what it needs, answers the question, and moves on. Ask a related question tomorrow and it fetches the same page again. That is fine for one-off lookups. It falls apart the moment you are doing ongoing research against the same set of sources.

Research is cumulative. You revisit sources, compare them over time, and ask new questions of old data. If every question triggers another crawl, your assistant is behaving like a search engine, not a research system. The bottleneck is not crawling. It is the lack of memory.

This guide builds the missing memory: a persistent research dataset with the Crawlbase Web MCP Server. You crawl each page once, store it in Crawlbase Cloud Storage as a reusable Markdown snapshot, and run every later analysis against the stored dataset instead of the live web. A companion repository ships the prompts, MCP config, sample URLs, and an ingestion script used throughout.

Why AI research workflows keep starting over

If you have built AI research workflows before, you know the pattern. You ask an agent to analyze a competitor's pricing page. It crawls the page, extracts the details, answers, and forgets. A few days later you ask a different question about the same company, and it crawls the page again. Next week you compare AI features across ten competitors, and every page gets crawled a third time.

Nothing is technically broken. It is how most AI scraping systems work today. The problem is that every question starts from scratch, because the system was designed around retrieval: fetch, answer, discard.

Continuous collection is sometimes the point. An AI product monitoring tool revisits pages on a schedule precisely to catch new prices, stock changes, or rating shifts. Research is different. You are not watching for what changed in the last hour; you are building knowledge you can revisit, compare, and re-interrogate for weeks. So treat the pages as reusable assets, not disposable inputs: crawl once, store, and let analysis run against the dataset.

Architecture: from web pages to research assets

Once you treat web content as a dataset instead of a search result, collection and analysis become two separate problems. The Web MCP Server handles both; Cloud Storage preserves the snapshots after the conversation ends; a small manifest is the catalog that ties them together.

Pages are collected once and analyzed many times. A single crawl lands a Markdown snapshot in Cloud Storage; the manifest indexes it; every later question runs against the stored dataset instead of the live site.

Instead of revisiting a site whenever a new question appears, the assistant works against snapshots that already exist. The manifest indexes what has been collected (URLs, crawl timestamps, company names, storage IDs) without forcing every document into memory.

Metadata is cheaper than documents

When you are working across dozens or hundreds of pages, loading every one is wasteful. Explore metadata first, narrow the set, and pull full documents only when they earn it. That keeps analysis fast now and matters more as the dataset grows.

Connect the Web MCP Server

Point your MCP client at the Crawlbase Web MCP Server before building anything. If you want a fuller tour of what the server exposes first, see our introduction to the Crawlbase Web MCP Server.

json

{
  "mcpServers": {
    "crawlbase": {
      "type": "stdio",
      "command": "npx",
      "args": ["@crawlbase/mcp@latest"],
      "env": {
        "CRAWLBASE_TOKEN": "YOUR_TOKEN",
        "CRAWLBASE_JS_TOKEN": "YOUR_JS_TOKEN"
      }
    }
  }
}

The companion repo includes a ready-made mcp-config.sample.json. Drop it into Cursor, Codex, or any MCP-compatible client, replace the token placeholders with your Crawlbase credentials, and restart. You should then see tools such as crawl_markdown, storage_count, storage_list, storage_get, and storage_bulk_get. From here the assistant can crawl, store, retrieve, and manage the dataset with no custom code.

Build the dataset once

The sample URL list holds twenty public SaaS pricing pages. The build prompt crawls each one, stores a Markdown snapshot, and records the metadata in output/dataset-manifest.json.

The one setting that matters is store=true. Without it, a page exists only inside the current conversation; when the session ends, the content is gone and the next question needs another crawl. With it, Crawlbase keeps the snapshot in Cloud Storage and returns an RID you can use to pull the document back later. That one flag is what turns a stream of temporary responses into a reusable dataset.

Work against the dataset, not the web

Once the pages are stored, the workflow changes: you are querying a dataset, not browsing sites. The analysis prompt starts with metadata, not documents.

mcp tools

storage_count
storage_list
storage_bulk_get(as=metadata_only)

Use the metadata to see what exists and decide which records deserve a closer look, then retrieve full Markdown only where you need it. From there the same prompt builds a comparison across competitors: it classifies billing models, pulls plan names and headline prices, and flags whether a free tier exists. By the end you can answer questions like which billing model is most common, who uses usage-based pricing, and how many vendors publish a free plan, all without touching the live pages again.

Detect change over time

"Which competitors changed their pricing model in the last three months?" is a common competitive-intelligence question, and it only works if you kept history. The change-detection prompt compares snapshots over time.

With a single snapshot per competitor, it classifies the current model and explains that time comparisons are not yet possible. With multiple snapshots, it diffs versions and surfaces real shifts: per-seat moving to usage-based, flat-rate turning into hybrid, or a packaging overhaul. Each crawl adds a layer. The first gives you visibility, the second gives you comparison, the third starts to show a trend.

History turns snapshots into trends. One version is a reading; two make a comparison; a third stacks into a trend line you can reason about, which is what change detection needs.

Over time the dataset stops being a pile of pages and becomes a record of how those pages change.

Reuse and clean up

The payoff of stored snapshots shows up after collection: new questions no longer mean new crawls. The reuse prompt runs entirely different analyses against the same twenty pages, including who offers a free tier, who shows annual and monthly pricing side by side, who leads with usage-based pricing, and who pushes AI features on the pricing page. The source material is already collected; the assistant just asks new questions of it. If you want the agent to act on that data in a live loop rather than analyze a stored set, see Build AI Agent Workflows with Web MCP.

When a project wraps, clear out snapshots you no longer need so they do not muddy future sessions. The cleanup prompt lists stored records, asks for confirmation, and deletes in batches. Because deletion is irreversible, it always confirms before removing anything.

Automate collection

Running prompts by hand is ideal while you are exploring. Once the workflow is routine (same sources, on a schedule, growing datasets), automate the collection stage. The repo's ingest_dataset.py does exactly that through the Crawling API.

bash

pip install -r requirements.txt
export CRAWLBASE_TOKEN="YOUR_CRAWLBASE_TOKEN"
python ingest_dataset.py --urls urls.saas-pricing.txt

The script reads the URL list, requests each page as Markdown, stores the snapshot, and writes a manifest. The request itself is deliberately plain:

python

response = requests.get(
    "https://api.crawlbase.com/",
    params={
        "token": token,
        "url": url,
        "format": "md",
        "md_readability": "true",
        "store": "true",
    },
)

It asks for Markdown output with format=md, turns on readability with md_readability=true, and stores the result with store=true. Rather than saving document bodies locally, it captures what it needs to retrieve them later, the most important being the RID that Cloud Storage returns for each page. Those records land in output/dataset-manifest.json:

json

{
  "generated_at": "...",
  "entry_count": 20,
  "stored_count": 20,
  "entries": [...]
}

Think of the manifest as the catalog: the documents live in Cloud Storage, and the manifest records how to find them. It does the same work as the MCP workflow, only repeatable.

Infrastructure instead of re-crawling

Building a research dataset normally means stitching together a crawler, a storage layer, a retrieval mechanism, and an analysis workflow. The Crawlbase Web MCP Server collapses most of that into tools that live inside Cursor, Codex, and other MCP clients, and Cloud Storage keeps the snapshots reachable long after the crawl.

That changes the economics. Collect content once and reuse it across many analyses, and every page becomes a research asset instead of a throwaway response. The value of the dataset grows while the cost of collection stays roughly fixed. The same idea underpins machine-learning pipelines, where collected data is reused across training and evaluation; see Web Scraping for Machine Learning for that angle. For ongoing market research and competitive intelligence, that shift is often worth more than the crawl itself.

Crawlbase Web MCP Server

Give your AI client crawling, storage, and retrieval in one set of tools. Every crawl renders JavaScript behind a rotating residential IP and returns clean Markdown, with snapshots preserved in Cloud Storage for reuse. No proxy pool, no headless fleet, no custom code. Build your first dataset on the free tier.

Start free

Key takeaways

Research systems and retrieval systems solve different problems; most AI workflows are built for retrieval.
Re-crawling the same pages for every question pays the collection cost over and over.
Persistent storage separates acquisition from analysis, so one crawl serves many future questions.
Metadata-first exploration scales better than loading every document.
Historical snapshots are what make trend analysis and change detection possible.
A research dataset gets more valuable over time because collection cost is amortized across every later question.
The Crawlbase Web MCP Server combines crawling, storage, retrieval, and analysis into a single workflow, and the companion repo is a working implementation of it.

Frequently asked questions

What is the difference between an AI research dataset and a RAG knowledge base?

A RAG knowledge base is optimized for retrieving relevant context at query time: documents are chunked, embedded, and searched so a model can answer with the right context. An AI research dataset is optimized for accumulation: the goal is to collect and preserve information over time so it can support many future analyses, including RAG, competitive intelligence, market research, and trend detection. You can build a RAG system from a research dataset, but the dataset is broader than any single retrieval pipeline.

Why store web pages instead of crawling them every time?

Repeated crawling is fine for one-off questions but inefficient for ongoing research. Say you collect twenty competitor pricing pages today; tomorrow you compare AI features, next week you analyze annual discounts, a month later you review enterprise packaging. The pages may not have changed, yet repeated crawling makes you pay the collection cost every time. Storing snapshots separates acquisition from analysis, so the same dataset answers many future questions without touching the original sites again.

Why use Markdown instead of raw HTML?

Markdown keeps the information that matters and drops most of the presentation noise. Headings stay headings, lists stay lists, tables stay readable. Raw HTML carries navigation menus, scripts, and styling that add little to research, and Markdown snapshots are easier to read, analyze, chunk, embed, and compare across versions.

Can I use this approach for data other than SaaS pricing pages?

Yes. The repo uses pricing pages because they are easy to reason about and demonstrate competitive-intelligence workflows, but the same architecture fits product documentation, industry reports, public filings, news articles, knowledge-base content, academic resources, and market research sources. The acquisition and storage workflow stays the same regardless of what you are collecting.

Does the Crawlbase Web MCP Server replace vector databases and embeddings?

No. The Web MCP Server handles acquisition, storage, and retrieval of source documents. Vector databases and embedding models come in when you want semantic search, RAG pipelines, or similarity-based retrieval. Many teams use the Web MCP Server as the acquisition layer and later feed stored documents into embedding pipelines, vector stores, or agents, so the dataset becomes the foundation other AI systems build on.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why AI research workflows keep starting over

Architecture: from web pages to research assets

Connect the Web MCP Server

Build the dataset once

Work against the dataset, not the web

Detect change over time

Reuse and clean up

Automate collection

Infrastructure instead of re-crawling

Key takeaways

Frequently asked questions

What is the difference between an AI research dataset and a RAG knowledge base?

Why store web pages instead of crawling them every time?

Why use Markdown instead of raw HTML?

Can I use this approach for data other than SaaS pricing pages?

Does the Crawlbase Web MCP Server replace vector databases and embeddings?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

LLM-Ready Markdown Web Scraping: Clean Data for AI

The infrastructure brief, in your inbox.

We use cookies

Customize cookies