Finding quality leads is slow work. Anyone who has done sales or marketing knows the grind: open a browser, hunt for companies that fit your profile, copy their details into a spreadsheet, repeat. This guide shows you how to skip most of that by building an AI sales bot with the Crawlbase Web MCP, an assistant that gathers public company and contact-page data on its own and drafts your outreach for you.

To keep this honest and defensible, the whole build is scoped to public business data: company names, websites, public contact pages, listed industries, and the generic email addresses a business publishes for inbound enquiries. It does not scrape login-walled content, buy or harvest personal data, or try to enrich individual people beyond what a company chooses to publish. The ethics and ToS section near the end is not boilerplate, so read it before you point this at real volume.

What an AI sales bot crawlbase web mcp build actually does

The pattern is simple once you see the two halves. Claude (or any MCP-capable client) supplies the reasoning: it reads your request, decides which pages to look at, and structures the result. The Crawlbase Web MCP supplies the hands: it fetches live pages, renders JavaScript-heavy sites, gets past common blocks, and hands back clean content for the model to read.

On its own a language model cannot browse the open web reliably. The MCP server is what closes that gap. It exposes a small set of tools the model can call to crawl and read a URL, so instead of you pasting page contents in by hand, the model requests them itself and works with the structured output. If you want the broader picture of why this matters, feeding real-time web data to LLMs covers the motivation in depth.

A realistic run looks like this: you describe an ideal customer profile, the bot finds companies that match, visits each company's public site and contact page, pulls the fields you asked for, and returns a tidy table plus draft outreach copy tailored to each one. No scraping scripts, no proxy pool to babysit, no browser automation to maintain.

Public data only

Everything below targets data a company publishes for the public: its site, its listed industry, and the generic contact addresses (sales@, hello@, info@) it asks people to use. The bot is told explicitly not to collect personal data about named individuals. That boundary is what keeps the work defensible, and it is enforced in the prompt, not left to chance.

How the pieces fit together

Three components do the work, and it helps to name them before wiring anything up.

Claude is the reasoning layer

The model breaks your request into steps, decides which pages are worth visiting, picks the fields to extract, and formats the result. You describe the outcome you want in plain language and it plans the path. That planning is the part you would otherwise do by hand.

The Web MCP is the crawling layer

The MCP server is what lets the model reach the live web. It fetches the page behind a trusted IP, renders client-side content, handles common anti-bot friction, and returns clean output. Under the hood it leans on the same infrastructure as the Crawling API and Smart AI Proxy, so you get real-time data without assembling a headless fleet and a rotating proxy pool yourself.

Your output is structured leads plus drafts

Once the pages come back, the model assembles each company into a row, name, website, public contact email, industry, short description, and then drafts a short outreach message keyed to what it read on that company's site. You can export the table to CSV for a CRM, or keep iterating in the chat.

Step 1: Install and configure the Web MCP

The Web MCP runs as a local server that your MCP client launches. For Claude Desktop, open Settings, Developer, Edit Config, and add the Crawlbase server to the mcpServers block. The config below points the client at the published package and passes your tokens through the environment.

json
{
  "mcpServers": {
    "crawlbase": {
      "command": "npx",
      "args": ["-y", "crawlbase-mcp"],
      "env": {
        "CRAWLBASE_TOKEN": "YOUR_NORMAL_TOKEN",
        "CRAWLBASE_JS_TOKEN": "YOUR_JS_TOKEN"
      }
    }
  }
}

Two tokens, both from your Crawlbase dashboard. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Company sites that build their content client-side need the JS token, so keep both available and let the bot choose. Save the config and restart the client so it picks up the new server.

Normal token vs JS token

The normal token is faster and cheaper and works for plain server-rendered pages. The JS token spins up a real browser and is what you need for sites that load their content with JavaScript. Many marketing sites are static enough for the normal token; reach for the JS token when a page comes back thin or empty.

Step 2: Confirm the tools are live

After the restart, your client should list the Crawlbase tools (typically a crawl tool and a read tool). A quick sanity check is to ask the model to fetch one known public page and summarize it. If it returns real content rather than an apology about not having web access, the server is wired correctly.

text
Using the Crawlbase Web MCP, fetch https://example.com and give me
a one-line summary of what the company does.

If that comes back with a real summary, the model is calling the MCP tools and reading live pages. You are ready to give it the actual job.

Step 3: Give the bot its sales brief

Now you tell the bot what to collect and, just as important, what not to. The prompt below sets the role, scopes the search to public business data, names the exact fields, and bans personal data outright. Treat it as a system instruction and paste it into a fresh conversation.

text
You are an AI sales bot. You use the Crawlbase Web MCP to gather
PUBLIC business data only and to draft outreach.

Rules:
- Collect only data a company publishes publicly: company name,
  website, listed industry, public description, and the generic
  contact email on the site (e.g. sales@, hello@, info@).
- Do NOT collect personal data about named individuals.
- Do NOT guess or pattern-build emails. Use only what is published.
- If a field is not public, leave it blank rather than inferring it.

For each company, return a row with: Company Name, Website,
Contact Email, Industry, Description, and a 2-sentence draft
outreach note referencing something specific from their site.

The two "do NOT" lines are doing real work. They keep the bot on published, company-level data and stop it from fabricating addresses, which is both an accuracy problem and an ethics one. Once the role is set, send the actual target in a follow-up message.

text
Find 10 B2B SaaS companies in Singapore that offer CRM or
marketing automation. For each, visit the company site and its
public contact page, then fill in the row format above.
Crawlbase Web MCP

The Web MCP gives your assistant live web access in one install. It renders JavaScript pages, gets past common blocks, and returns clean content for the model to read, so your sales bot pulls real, up-to-date public company data without you running a crawler or a proxy pool. Start on the free tier and point it at a handful of public sites first.

Step 4: Let the bot crawl and draft

With the brief and the target in place, the bot plans its run. It identifies candidate companies, calls the MCP crawl tool on each company's public site and contact page, reads what comes back, and fills in the row format you defined. Because the model can see the actual page text, the draft outreach note it writes can reference something concrete, a product line, a recent launch, a stated focus, rather than generic filler.

A trimmed example of what comes back:

json
[
  {
    "companyName": "Acme CRM",
    "website": "https://acmecrm.example",
    "contactEmail": "[email protected]",
    "industry": "B2B SaaS / CRM",
    "description": "Pipeline and contact management for SMB teams.",
    "draftNote": "Saw your pipeline-automation focus for SMB teams..."
  }
]

Ask the bot to render the same data as a CSV and you can drop it straight into a spreadsheet or CRM import. The draft note stays a draft: a starting point you edit, not a message to fire off unread.

Step 5: Tighten the loop

The first run is rarely the final one. A few small additions make the bot far more useful in practice, and each is a one-line follow-up in the chat rather than new code.

Validate before you trust

Not every published address is monitored, and generic inboxes vary in quality. Ask the bot to flag rows where the contact email looks like a catch-all or where the contact page was missing, so you can prioritize the clean ones. This is a quality gate, not personal-data enrichment.

Deduplicate across runs

Run the bot a few times with overlapping criteria and the same company shows up twice, sometimes under slightly different names. Have it compare domains and merge duplicates before returning the table. It keeps your list tidy and stops you from contacting the same company twice.

Tune the targeting

The whole system is just a prompt, so retargeting is a sentence away. Swap in a new industry, region, or niche and the bot adapts. The more specific the brief, the more relevant the results, so lean toward precise asks like "fintech companies in Canada offering lending or payments" over broad ones. For more advanced multi-step setups, building AI agent workflows with the Crawlbase Web MCP goes further than a single prompt.

Where this fits in a real sales stack

A bot like this is not a replacement for a CRM or a sequencing tool; it is the research front-end that feeds them. It turns "build me a list" from an afternoon of tab-juggling into a conversation, and it keeps the data fresh because every run pulls live pages rather than reusing a stale export. If you are weighing managed web access against rolling your own proxy layer, AI proxy use cases lays out where each approach earns its keep.

One framing worth holding onto: published industry figures on lead-research time vary widely and are easy to cherry-pick, so treat any "saves X hours" claim, including ones you might be tempted to make internally, as directional rather than precise. The real win is consistency, the bot collects the same fields the same way every time, which is harder to do by hand than it sounds.

The honest part: ToS and personal data

Gathering business data from public sites sits in a gray area, and whether a given run is allowed depends on each site's terms of service, your jurisdiction, and what you do with the data. Many sites restrict automated access in their terms, so crawling can run against those terms regardless of how careful your tooling is. The MCP makes the technical part work; it does not change the rules you are operating under.

A few lines worth holding to. Collect only public, company-level data: names, sites, listed industries, and the generic contact addresses a business publishes for inbound enquiries. Do not harvest personal data about named individuals, do not pattern-build or guess email addresses, and do not pull anything behind a login. Respect each site's robots.txt and stated rate expectations, and keep request volume low enough that you are not straining anyone's servers. When you actually send outreach, follow the anti-spam laws that apply to you, give people a clear way to opt out, and honor it.

This guide is deliberately scoped to public business data because that is the line that keeps the work defensible. If your project needs richer contact data, the right move is a consented data source or an official provider, not a cleverer crawler. The same managed access that powers the Web MCP, the Crawling API, the AI Proxy, and the Crawling API, is built for public-data collection done responsibly, not for slipping past consent.

Recap

Key takeaways

  • Two halves, one bot. Claude reasons and plans; the Crawlbase Web MCP fetches and renders live pages. Together they research and draft outreach with no scraping code.
  • Install is config, not code. Add the Crawlbase server to your MCP client's config with your normal and JS tokens, restart, and confirm the tools are live.
  • The prompt is the product. Set the role, scope to public data, name the fields, and ban personal data. Retargeting is a one-sentence follow-up.
  • JS token for rendered sites. Use the normal token for static pages and the JS token when content loads client-side.
  • Stay on public, company-level data. No login-walled content, no guessed emails, no personal data about individuals. Respect ToS, robots.txt, and anti-spam law.

Frequently Asked Questions (FAQs)

What is an AI sales bot built on the Crawlbase Web MCP?

It is an assistant that pairs a language model's reasoning with the Crawlbase Web MCP's live web access. You describe an ideal customer profile, the model plans the search and calls the MCP tools to crawl public company sites, and it returns structured leads plus draft outreach, without you writing a scraper or running a proxy pool.

Do I need to write any code to build this?

No. The build is a config entry plus prompts. You add the Crawlbase server to your MCP client's configuration with your tokens, restart, then give the bot a role prompt and a target. The model handles the crawling and formatting through the MCP tools.

Do I use the normal token or the JS token?

Both. Add both to the config and let the bot choose per page. The normal token fetches static HTML and is faster; the JS token renders the page in a real browser first and is what you need for sites that build their content with JavaScript. If a page comes back thin or empty, switch that fetch to the JS token.

Can the bot find personal email addresses for specific people?

It should not, and this guide tells it not to. The prompt restricts collection to public, company-level data such as generic contact addresses (sales@, hello@, info@) and bans guessing or pattern-building personal emails. Harvesting or inferring data about named individuals is out of scope on purpose, both for accuracy and for compliance.

How accurate is the data the bot returns?

It is as accurate as the pages it reads, which is why the prompt tells it to leave a field blank rather than infer it. Public sites can be out of date or inconsistent, so add a validation pass: have the bot flag missing contact pages or catch-all addresses, and review the rows before any outreach. Treat draft notes as starting points you edit, not finished messages.

It depends on each site's terms of service, your jurisdiction, and your purpose, and many sites restrict automated access. Keep strictly to public, company-level data, respect robots.txt and rate expectations, and never collect personal data or login-walled content. When you send outreach, follow the anti-spam laws that apply to you and offer a clear opt-out. For richer contact data, use a consented source rather than a more aggressive crawler.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available