Build AI Agent Workflows with Web MCP

If you have ever wired an AI agent up to the live web, you already know where it falls apart. The agent reasons fine, but the moment it needs real page content it hits a wall: the site renders client-side, the HTML is a tangle, or the request gets challenged before any data comes back. The fix is not a smarter prompt. It is giving the agent a tool that returns clean, structured web data on demand, and letting the agent decide when to call it.

That is exactly what the Crawlbase Web MCP server provides. This guide shows you how to build AI agent workflows around the Crawlbase Web MCP: the planning loop the agent runs, the MCP tool calls it makes to scrape and crawl, and a concrete end-to-end example that takes a URL, fetches the rendered page, and returns a structured answer. No custom scraping code, no proxy pool to babysit, no parsing rules baked into the agent.

What the Crawlbase Web MCP adds to an agent

MCP, the Model Context Protocol, is the open standard that lets a language model call external tools through a consistent interface. An MCP server publishes a set of tools, and any MCP-aware client (Claude Desktop, Cursor, n8n, or your own agent) can discover and invoke them. The Crawlbase MCP server publishes web-access tools, so the agent gains the ability to read any public URL the way a real browser would.

Under the hood, those tools are backed by the same Crawling API that powers the rest of Crawlbase. That means the agent inherits JavaScript rendering, residential IP rotation, anti-bot handling, retries, and clean output without knowing any of it exists. From the agent's point of view it just called a tool and got back readable content. For a fuller tour of what the server exposes, see our introduction to the Crawlbase MCP.

The Web MCP typically exposes two tools your agent will reach for:

crawl fetches a single URL and returns the rendered page as clean markdown or HTML, ready for the model to read.
crawl_markdown (or a screenshot/structured variant, depending on your server build) returns the same content trimmed to readable text, which keeps token usage down on long pages.

Why this beats a raw HTTP tool

You could hand the agent a plain HTTP-request tool instead. On modern sites it rarely holds up: most pages render client-side and challenge automated traffic, so raw fetches return empty shells or blocks. The MCP tool routes through the Crawling API, which renders the page behind a trusted IP and returns finished content, so the agent gets real data on the first call rather than a retry loop.

The agent loop, step by step

An agent workflow is a loop, not a straight line. The model plans, picks a tool, reads the result, and decides whether it has enough to answer or needs another call. With the Web MCP wired in, that loop looks like this:

Receive the task. The agent gets an instruction that usually contains a URL or a topic to research.
Plan. It reasons about whether it can answer from what it knows or whether it needs live web data.
Call the MCP tool. When it needs the page, it invokes crawl with the target URL.
Read the result. Crawlbase returns clean, rendered content, which the model ingests as tool output.
Decide. Enough to answer? It writes the structured response. Not yet? It loops back, crawling another URL or refining the query.
Return. It hands back a clean, structured result in whatever shape you asked for.

The important shift is that the decision to scrape is made by the agent, not hardcoded by you. You describe the goal; the agent figures out which pages it needs and when to fetch them.

Step 1: Run the Crawlbase Web MCP server

Any MCP client connects to the server through a small config block. You point the client at the Crawlbase MCP package and pass your token through the environment. Here is a typical configuration for a desktop MCP client.

json

{
  "mcpServers": {
    "crawlbase": {
      "command": "npx",
      "args": ["-y", "@crawlbase/mcp"],
      "env": {
        "CRAWLBASE_TOKEN": "YOUR_CRAWLBASE_JS_TOKEN"
      }
    }
  }
}

Use your JavaScript (JS) token here. Crawlbase issues two token types: the normal token fetches static HTML, while the JS token renders the page in a real browser first. Because most sites worth crawling are client-side rendered, the JS token is the safe default for agent work. You get both tokens from the dashboard after signing up.

If you are running an agent platform like n8n instead of a desktop client, you connect to a hosted MCP endpoint over HTTP rather than spawning the process locally. The full n8n setup is covered in connecting n8n with the Crawlbase Web MCP; the rest of this guide builds the agent in code so you can see the loop directly.

Step 2: Build the agent that calls the MCP tools

Now wire a real agent to the server. The pattern below uses Python with an MCP client library and a tool-calling model. The agent connects to the Crawlbase MCP server, discovers the available tools, and hands them to the model so it can decide when to crawl.

python

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

server = StdioServerParameters(
    command="npx",
    args=["-y", "@crawlbase/mcp"],
    env={"CRAWLBASE_TOKEN": "YOUR_CRAWLBASE_JS_TOKEN"},
)

async def connect():
    async with stdio_client(server) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools = await session.list_tools()
            print([t.name for t in tools.tools])
            return session

asyncio.run(connect())

Running this prints the tool names the server exposes, confirming the agent can see crawl and its siblings before you ask the model to use them. That discovery step is what makes the workflow portable: swap in a different MCP server later and the agent adapts to whatever tools it finds.

Step 3: Give the model a tool-calling loop

With the session live, the loop is straightforward. You give the model the task and the tool list, let it emit a tool call, run that call against the MCP server, feed the result back, and repeat until the model stops calling tools and writes its answer.

python

async def run_agent(session, model, task):
    messages = [{"role": "user", "content": task}]
    tools = (await session.list_tools()).tools

    while True:
        reply = await model.chat(messages, tools=tools)

        if not reply.tool_calls:
            return reply.content

        for call in reply.tool_calls:
            result = await session.call_tool(call.name, call.arguments)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result.content,
            })

That while loop is the entire agent. The model plans, calls crawl when it wants the page, reads the markdown Crawlbase returns, and either answers or crawls again. You never tell it which URL to fetch or when to fetch it; you describe the outcome and it routes itself there.

Step 4: Steer the agent with a system prompt

The one place ambiguity creeps in is whether the model trusts that it should use the tool at all. A short, explicit system message removes the doubt and locks in a consistent output shape.

python

SYSTEM = """You are a web research assistant with crawl tools.

Always use the crawl tool to read a URL before answering about it.
Never guess page contents from memory. After crawling, extract only
the fields requested and return them as structured JSON."""

task = (
    "Crawl https://www.example-store.com/product/123 and return "
    "the product name, price, rating, and a one-line summary."
)

With that in place, a single run produces a clean object: the agent crawls the page, reads the rendered content Crawlbase returns, and emits exactly the fields you asked for. This is the same idea behind structured AI data extraction, except the model decides for itself when to reach for the page.

Crawlbase Web MCP

The Web MCP server gives your agent live web access in one tool call. It is backed by the Crawling API, so every crawl renders JavaScript behind a rotating residential IP and returns clean markdown, with no proxy pool or headless fleet for you to run. Point an agent at a public page on the free tier first.

Start free

A concrete workflow: competitor price watch

Tie the pieces together with a workflow you would actually run. Say you want a daily check on a handful of competitor product pages: current price, availability, and any promo banner. You give the agent the list and let it work through it.

python

urls = [
    "https://competitor-a.com/p/widget",
    "https://competitor-b.com/p/widget",
]

async def price_watch(session, model):
    rows = []
    for url in urls:
        task = f"Crawl {url}. Return price, in_stock, promo as JSON."
        rows.append(await run_agent(session, model, task))
    return rows

Each iteration runs the full agent loop: the model crawls the URL through the MCP tool, Crawlbase renders it and rotates the IP, and the agent returns a structured row. The output is a tidy array you can diff against yesterday's run, push to a sheet, or alert on when a price moves.

The same skeleton flexes to other jobs without rewrites. Swap the task string and you have a news monitor, a research assistant that gathers notes across several sources, or a lead-enrichment step over public company pages. Because the agent talks to Crawlbase through one stable tool, pointing it at a new site needs no new API wiring. For more on where this fits, the AI proxy use cases roundup walks through adjacent patterns.

Tuning crawls for tough pages

Most pages crawl cleanly with defaults, but heavy single-page apps sometimes need a hint. The MCP tools accept the same waiting options the Crawling API uses, so you can pass them in the tool arguments when a page renders late. Two matter most: an ajax-wait flag that holds for asynchronous content, and a page-wait value in milliseconds for a fixed pause after load.

json

{
  "url": "https://www.example-store.com/product/123",
  "ajax_wait": true,
  "page_wait": 5000
}

If results come back thin, raise page_wait before reaching for anything else. You can let the agent set these itself by describing the page in the system prompt ("for slow single-page apps, wait for ajax content"), or hardcode them in a wrapper when you know the target is heavy. Either way the rendering, rotation, and retry behavior stays on the Crawlbase side; the agent just reads the result.

If a site is so hostile that even rendered crawls struggle, the Smart AI Proxy gives you a single rotating endpoint to route requests through, and the Crawling API returns pre-parsed JSON for popular sites when you would rather skip the model parsing the page at all. Both share the same infrastructure the MCP tools sit on.

Keeping the workflow reliable

A few habits keep an agent workflow healthy in production. Add a check after each run so a failed crawl is visible instead of silently producing an empty row. Pace your requests when looping many URLs rather than firing them all at once. Persist the structured output somewhere, a database or even a spreadsheet, so you can look back and diff over time. And tune the prompt per target when needed: one generic instruction across very different sites usually gives weaker results than a few site-specific lines.

When an agent reports that "no tools were used," it almost always means the model was not confident it should crawl. Tightening the system message and making sure the URL is clearly in the task resolves it. For connection problems, check that the MCP server is running, confirm the token is set in the environment, and list the tools first to prove the handshake works before debugging the model.

Recap

Key takeaways

The MCP server is the agent's web access. It publishes crawl tools any MCP-aware client can discover and call, backed by the Crawling API.
The agent owns the decision to scrape. You describe the goal; the model plans, calls the tool when it needs a page, reads the result, and loops or answers.
Use the JS token. It renders client-side pages in a real browser, which is what most modern sites require to return real content.
The loop is portable. Discover tools at runtime and the same agent adapts to new sites with no new API wiring.
Tune with wait options. Pass ajax_wait and page_wait for heavy single-page apps; raise page_wait first when results come back thin.
Add guardrails. Check for failed crawls, pace requests, persist output, and tune prompts per target.

Frequently Asked Questions (FAQs)

What is the Crawlbase Web MCP and how does an agent use it?

The Crawlbase Web MCP is a Model Context Protocol server that publishes web-access tools, chiefly a crawl tool, to any MCP-aware AI agent. The agent connects to the server, discovers the tools, and calls them when it needs live page content. Each call is backed by the Crawling API, so the agent receives rendered, clean content without writing any scraping code.

Do I need the normal token or the JS token for agent workflows?

Use the JS token for agent work. The normal token fetches static HTML, which on modern client-side sites is an empty shell. The JS token renders the page in a real browser before returning it, so the content the agent reads actually contains the data. You get both tokens from the Crawlbase dashboard after signing up.

Which AI agents and platforms work with the Crawlbase Web MCP?

Any MCP-compatible client works, including Claude Desktop, Cursor, Windsurf, and agent platforms like n8n, as well as custom agents you build with an MCP client library. As long as the client can connect to the server and call tools, it can use the Crawlbase crawl tools.

Can the agent scrape JavaScript-heavy sites without extra setup?

Yes. The crawl tool renders JavaScript automatically through the Crawling API, so the agent receives fully rendered content without you running Puppeteer or Selenium. For pages that load late, pass ajax_wait and a larger page_wait in the tool arguments and the API holds until the content appears.

How does this avoid getting blocked?

The MCP tools route through the Crawling API, which rotates residential IPs, manages browser fingerprinting, and handles anti-bot challenges and retries server-side. The agent never sees that machinery; it just gets clean content back. Keep your request rate reasonable when looping many URLs and the workflow stays healthy.

How is this different from giving the agent a plain HTTP request tool?

A raw HTTP tool returns whatever the server sends, which on most modern sites is an unrendered shell or a block. The Crawlbase MCP tool renders the page behind a trusted IP and returns finished content on the first call, so the agent spends its turns reasoning about real data instead of retrying failed fetches.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What the Crawlbase Web MCP adds to an agent

The agent loop, step by step

Step 1: Run the Crawlbase Web MCP server

Step 2: Build the agent that calls the MCP tools

Step 3: Give the model a tool-calling loop

Step 4: Steer the agent with a system prompt

A concrete workflow: competitor price watch

Tuning crawls for tough pages

Keeping the workflow reliable

Key takeaways

Frequently Asked Questions (FAQs)

What is the Crawlbase Web MCP and how does an agent use it?

Do I need the normal token or the JS token for agent workflows?

Which AI agents and platforms work with the Crawlbase Web MCP?

Can the agent scrape JavaScript-heavy sites without extra setup?

How does this avoid getting blocked?

How is this different from giving the agent a plain HTTP request tool?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

Build an AI Research Dataset with Web MCP: Crawl Once, Reuse Forever

The infrastructure brief, in your inbox.

We use cookies

Customize cookies