Large language models are good at reasoning over text and bad at knowing what happened five minutes ago. Their knowledge is frozen at training time, they run in sandboxed environments with no outbound web access, and they are not browsers. The moment your question depends on a live page, a current price, or a story that broke this morning, the model is guessing, and a confident guess about stale data is just a well-phrased hallucination.

The Crawlbase Web MCP server closes that gap. It is an MCP server for AI scraping that gives an LLM client a set of tools to crawl and read live web pages, returning clean structured data the model can reason over in the same turn. This guide explains what the Model Context Protocol is, what the Crawlbase Web MCP exposes, how to connect it to an MCP-capable client, and why real-time web access changes what your agents can actually do.

The problem: LLMs are disconnected from the live web

Every general-purpose model, from Claude to the rest, sits on a large static training set. That training lets the model reason, summarize, and predict, but it cannot observe. A few constraints make this concrete:

  • The knowledge is frozen. Anything that changed after the training cutoff is invisible until the next retrain.
  • The runtime is sandboxed. Models execute in environments that restrict outbound network access by design, so they cannot just go fetch a page.
  • Models are not browsers. Even with a URL in hand, a raw model has no engine to render JavaScript, follow redirects, or get past anti-bot defenses.

The workarounds developers fall back on are all bad in their own way: copy-pasting crawled results into the prompt by hand, accepting hallucinations when context is missing, or building agents that break the moment the underlying data updates. None of that scales, and all of it is the same root cause, which is that the model has no live connection to the web.

What the Model Context Protocol (MCP) is

The Model Context Protocol is an open standard that defines a consistent way for AI models to talk to external tools and data sources. Instead of every integration being a bespoke one-off, MCP gives the model a uniform interface: it can list the tools a server offers, call one with arguments, and get a structured result back into its context window.

Think of it as USB for AI. USB made it so any device plugs into any computer through one standard port; MCP makes it so any tool or data source plugs into any MCP-capable client through one standard protocol. An MCP client (Claude Desktop, Cursor, Windsurf, and a growing list of others) speaks the protocol; an MCP server exposes capabilities through it. The Crawlbase Web MCP is one such server, and the capability it exposes is the live web.

Client vs. server

In MCP terms, the client is the AI app you already use (a desktop assistant or an IDE), and the server is the thing it connects to for extra powers. You do not write a client. You run an MCP server, point your existing client at it, and the model gains the server's tools automatically.

What the Crawlbase Web MCP server exposes

The Crawlbase MCP server is the connective tissue between an LLM client and the real-time web. It is built on the same crawling infrastructure that already serves a large base of developers, so an agent reaching through it gets JavaScript rendering, server-side proxy rotation, and anti-bot handling without knowing any of that is happening. To the model, it is just a few tools that turn a URL into data.

The tools the server exposes do the fetching and the cleanup so the model receives content it can actually use:

  • crawl fetches a URL and returns the page HTML, rendered if the page needs JavaScript to populate.
  • crawl_markdown fetches a URL and returns clean Markdown, stripped of navigation chrome and boilerplate, which is the format models read most reliably.
  • crawl_screenshot captures a visual screenshot of a page for cases where layout or an image matters more than text.

Under the hood each of those is the same hardened crawl the Crawling API performs: a real browser renders the page behind a trusted residential IP, so client-side-rendered sites come back fully populated and the request reads as a genuine visitor rather than a flagged bot. The model never sees that machinery. It asks for a URL and gets back finished, structured content.

Why Markdown for models

The crawl_markdown tool exists because raw HTML wastes tokens on tags and layout the model does not need, and Markdown keeps the structure (headings, lists, links) the model does need. For more on why clean Markdown is the better input shape, see LLM-ready Markdown web scraping.

How to connect the Crawlbase Web MCP to a client

Connecting the server is the same three moves in any MCP client: get your tokens, drop a small JSON block into the client's config, and restart. Here is the full path.

Step 1: Get your Crawlbase tokens

Create a Crawlbase account, which starts with 1,000 free requests and adds more when you add a card. In the dashboard, open your account documentation and copy two tokens: the normal token for static pages and the JavaScript token for pages that render client-side. The MCP server uses both, picking the right one per request.

Step 2: Add the server to your client config

MCP clients read a JSON config that lists the servers they should launch. The Crawlbase entry tells the client to run the server over stdio with npx and hands it your tokens as environment variables. The same block works across Claude Desktop, Cursor, and Windsurf; only the file it lives in differs per client.

json
{
  "mcpServers": {
    "crawlbase": {
      "type": "stdio",
      "command": "npx",
      "args": ["@crawlbase/mcp@latest"],
      "env": {
        "CRAWLBASE_TOKEN": "your_token_here",
        "CRAWLBASE_JS_TOKEN": "your_js_token_here"
      }
    }
  }
}

Replace your_token_here and your_js_token_here with your actual normal and JavaScript tokens. Where this block goes depends on the client:

  • Claude Desktop: File, then Settings, then Developer, then Edit Config, which opens claude_desktop_config.json.
  • Cursor: Cursor Settings, then Tools and Integrations, then Add Custom MCP, which edits mcp.json.
  • Windsurf: Windsurf Settings, then MCP Servers, then Manage MCPs, then View raw config, which edits mcp_config.json.

Step 3: Restart and verify

Save the config and restart (or refresh) the client. Crawlbase should now show up under the client's list of connected MCP servers, with its tools available. If it does not appear, restart once more, since some clients only pick up server changes on a clean start.

Step 4: Use it from a prompt

You drive the tools in plain language. The model decides which tool to call and with what URL. A first prompt to confirm the wiring works looks like the example below; the client will usually ask you to approve the tool call the first time, so grant permission when prompted.

bash
Crawl https://www.nytimes.com and return the page as markdown

Behind that sentence the client invokes the crawl_markdown tool with the URL as its argument. Conceptually the call the client makes looks like this:

json
{
  "tool": "crawl_markdown",
  "arguments": {
    "url": "https://www.nytimes.com"
  }
}

The server renders the page, cleans it, and returns Markdown into the model's context, and the model answers from that live content instead of from memory. In an IDE like Cursor or Windsurf the same flow can write the result straight to a file, so a prompt to crawl a page and save it as Markdown produces a Markdown file on disk with the live content in it.

Crawlbase Web MCP

Give your AI client live web access in three lines of config. The Web MCP server exposes crawl, Markdown, and screenshot tools backed by real-browser rendering, residential IP rotation, and anti-bot handling, so the model gets clean data instead of a blocked request. Start on the free tier and point it at any public page.

Why real-time web access matters for agents

An agent that can read the live web is a different category of tool from one that cannot. The difference shows up the moment a task depends on something the model could not have memorized:

  • Research that is actually current. The model can pull today's article, pricing page, or release note and reason over it, instead of approximating from training data that may be a year stale.
  • Coding assistants with runtime awareness. An IDE agent can read the current docs for a library version rather than suggesting an API that was removed two releases ago.
  • Agents that do not break on updates. Because the data is fetched fresh each run, a workflow keeps working when the source page changes, instead of silently serving a cached snapshot.
  • Structured input, not screen-scraping. Clean Markdown and HTML mean the model spends its context on content, not on parsing layout noise.

This is the same shift that makes a managed proxy more useful to an agent than a raw IP list. If you want the wider picture of how AI tooling consumes the web, what is an AI proxy and AI proxy use cases cover the access layer, and how AI data extraction works covers what happens to the data once it arrives.

How the MCP fits with the rest of Crawlbase

The Web MCP is not a separate scraping engine; it is an MCP-shaped front door onto infrastructure you can also reach directly. The same rendering and unblocking that the MCP tools use is available through the Crawling API for code-driven crawls, through the Smart AI Proxy when you want an AI proxy endpoint you route normal requests through, and through the Crawling API when you want fields parsed out of common page types automatically.

The practical takeaway: use the Web MCP when the consumer is an LLM client and you want the model to fetch live data conversationally, and reach for the API or proxy products when the consumer is your own code. They share the same backend, so behavior is consistent across all of them.

Recap

Key takeaways

  • LLMs cannot see the live web. Their knowledge is frozen, their runtime is sandboxed, and they are not browsers, so anything current is a guess without an outside tool.
  • MCP is USB for AI. The Model Context Protocol is a standard interface that lets any MCP client call any MCP server's tools and get structured results into the model's context.
  • The Crawlbase Web MCP exposes crawl tools. crawl, crawl_markdown, and crawl_screenshot turn a URL into rendered HTML, clean Markdown, or an image, with rendering and anti-bot handling done server-side.
  • Setup is three steps. Get your tokens, paste one JSON block into your client config, restart, and the model gains live web tools.
  • Real-time access changes what agents can do. Current research, runtime-aware coding help, and workflows that do not break on source updates all depend on fresh data.

Frequently Asked Questions (FAQs)

What is the Crawlbase Web MCP server?

It is an MCP server for AI scraping that gives an LLM client tools to crawl and read live web pages. It exposes crawl, crawl_markdown, and crawl_screenshot over the Model Context Protocol, so a model can fetch a URL and receive rendered HTML, clean Markdown, or a screenshot directly in its context. The crawling, rendering, and unblocking happen on Crawlbase's infrastructure, so the model just sees finished data.

What is the Model Context Protocol (MCP)?

MCP is an open standard that defines a consistent way for AI models to talk to external tools and data sources. An MCP client (such as Claude Desktop, Cursor, or Windsurf) connects to MCP servers, lists their tools, calls them with arguments, and gets structured results back. It is often described as USB for AI because one protocol lets any compatible tool plug into any compatible client.

Which clients can connect to the Crawlbase Web MCP?

Any MCP-capable client. The setup in this guide covers Claude Desktop, Cursor, and Windsurf, which read a JSON config that launches the server over stdio. The same config block works across them; only the file it lives in differs by client. As more tools adopt MCP, the same server works with them too.

Do I need a normal token or a JavaScript token?

You provide both in the config. The server uses the normal token for static pages and the JavaScript token for pages that render client-side and need a real browser to populate. Supplying both lets the server pick the right one per request, so client-side-rendered pages come back fully loaded instead of as an empty shell.

How is the Web MCP different from the Crawling API?

They share the same backend; the difference is who calls them. The Web MCP is for LLM clients, letting a model fetch live data conversationally through MCP tools. The Crawling API is for your own code, called directly over HTTP. Use the MCP when an AI client is the consumer and the API when your application is.

Why do AI agents need real-time web access?

Because a model's training data is frozen and its runtime cannot reach the web on its own, any task that depends on current information (today's news, live pricing, the latest docs) is a guess without a tool. Real-time access lets the agent fetch fresh, structured content and reason over it in the same turn, which is what keeps research current and stops workflows from breaking when source pages change.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available