How to Build a Search Engine Tool

Q: Do I need to parse the HTML myself?

Usually not for common search engines. Pass autoparse=true in the CrawlbaseAPI-Parameters header and the Smart AI Proxy returns structured JSON with organic results, ads, the local pack, and related questions already broken out. You only fall back to manual parsing for unusual result pages where no parser exists, and even then you route the raw fetch through the same proxy.

Fetching a handful of search result pages is trivial. Fetching thousands an hour, every hour, without getting blocked is a different problem entirely. Search engines are tuned for human visitors, so automated traffic stands out fast: requests from a single address start drawing CAPTCHAs, empty pages, or outright bans, and results shift depending on where the request appears to come from. The hard part of any search engine tool is never the parsing. It is keeping access to the source pages alive as volume grows.

This guide shows you how to build a search engine tool that stays reliable at scale by routing every outbound request through the Crawlbase Smart AI Proxy. You point a standard HTTP client at one endpoint, and the proxy handles rotating IPs, geo targeting, and anti-bot mitigation for you. Your code owns the query logic and the product; Crawlbase owns the network-level reliability. We will walk through the constraints first, then build a working Python tool that queries a search engine, parses the results, and returns structured records.

Why scraping search results at scale is so hard

It helps to be specific about what breaks, because each failure mode points at the infrastructure you would otherwise have to build and babysit yourself.

IP blocking and bans

When many requests come from the same address, they look suspicious. Cross a threshold and the responses flip to errors, empty pages, or verification prompts. A single cloud instance can sail through testing and then fall over the moment real traffic arrives, because the volume that was fine for a demo is exactly what trips the limits.

Geo restrictions and localized results

Search results are not universal. A query from London can return different rankings and local listings than the same query from New York or Berlin. If your product depends on region-specific data, your requests have to appear to originate from those regions, which means controlling the exit location of each request.

CAPTCHA and anti-bot defenses

Modern search platforms layer their defenses. Even when a request technically succeeds with a 200, the page you get back can be a challenge rather than the actual results. Handling that reliably takes infrastructure that adapts as detection systems change, not a one-time fix.

Rate limits and throttling

High-frequency traffic from an identifiable source gets shaped or blocked. Without distributing requests across many routes, throughput eventually drops to zero no matter how efficient your code is. Building all of this in-house means running proxy pools, monitoring failures, rotating addresses, and reacting to detection changes. For most teams that is an operational burden, not a feature.

Why proxy rotation is the right fix

The Smart AI Proxy sits between your application and the search engine. You configure it like an ordinary proxy, send requests as usual, and get responses back as if you had connected directly. The difference is that each request is routed through infrastructure built specifically for automated data collection, so the failure modes above stop being your problem.

The characteristics that matter for a search tool:

Requests are distributed across a large pool of IPs instead of one address.
Traffic patterns are tuned to avoid the common triggers that get scrapers blocked.
Location targeting can be applied per request when you need regional results.
No special client libraries are required, so any HTTP-capable language works.

Optional behavior is controlled through a request header, CrawlbaseAPI-Parameters. That is how you turn on structured parsing for Google, for example, without changing your request logic. The connection details you need are short:

HTTPS (recommended): https://smartproxy.crawlbase.com:8013
HTTP: http://smartproxy.crawlbase.com:8012
Authentication: your Crawlbase token as the proxy username.

Why SSL verification is off

When you route through the Smart AI Proxy, SSL verification for the destination is typically disabled, because the proxy needs to inspect traffic to apply routing logic and response handling. In Python that means passing verify=False on the request. It is expected here and limited to traffic going through the proxy, not a blanket setting for your whole application.

What the tool actually does

A search engine tool has several parts, but only one of them talks to an external search engine. The Smart AI Proxy sits at that boundary as the outbound data collection layer, which keeps the rest of your system insulated from blocking. The flow is short:

A user submits a query.
Your application builds the matching search URL.
The request goes out through the Smart AI Proxy.
Results come back from the search engine.
The data is normalized and then stored or displayed.

Because every outbound request passes through the proxy, scaling up mostly affects cost and processing capacity rather than reliability. The fragile part, maintaining access to the source, is handled once at that boundary and then stops changing as you grow.

Set up the environment

You need Python 3.8 or later and a single dependency. Confirm your version, create a virtual environment so the install stays isolated, then add requests.

bash

python --version

python -m venv serp_env
source serp_env/bin/activate

pip install requests

On Windows, activate the environment with serp_env\Scripts\activate instead of the source line. You also need your Crawlbase token, which doubles as the proxy authentication key. Grab it from your dashboard after signing up and keep it out of source control by reading it from an environment variable.

bash

export CRAWLBASE_TOKEN=your_crawlbase_token_here

Step 1: Accept and normalize the query

Search engines expect properly encoded parameters. Raw user input like best coffee shops Paris has to become a valid query string before it can go into a URL. Spaces, special characters, and non-ASCII text all break the request if you pass them through untouched, so you encode them first. In Python that is quote_plus, which turns the raw string into best+coffee+shops+Paris. Keeping this in one place pays off later, because every search engine you support needs the same step.

Step 2: Construct the target SERP URL

Build the URL programmatically rather than stitching strings by hand at the call site. For a basic Google query only the q parameter is required, but production systems usually grow to support pagination, language flags, safe-search settings, device variants, and regional targeting. Centralizing URL construction means adding any of those later is a change in one function, not a hunt through the codebase.

Step 3: Route the request through the Smart AI Proxy

Direct requests to search engines fail quickly under load, so you configure your HTTP client to use the Smart AI Proxy as its outbound gateway. The pieces are the proxy endpoint, authentication with your Crawlbase token, and the standard proxy configuration your HTTP library already supports. From your application's point of view this behaves like any corporate proxy. The difference is that requests are transparently routed through infrastructure tuned for scraping workloads, so they get through where a direct request would not.

When you need a rendered page

Some result pages render their content client-side or guard it behind heavier challenges. When raw HTML is not enough, reach for the Crawling API, which renders the page in a real browser and returns finished HTML or markdown. The Smart AI Proxy is the drop-in choice for high-volume SERP fetching; the Crawling API is the tool when rendering or a JavaScript-heavy target gets in the way.

Step 4: Request structured results

The Smart AI Proxy can parse the HTML for you. Pass autoparse=true in the CrawlbaseAPI-Parameters header and the response comes back as JSON instead of raw markup. For Google that JSON includes the organic results, ads, the local pack, and related questions, alongside status fields. In most cases that removes manual HTML parsing entirely, which is one less brittle thing to maintain when the search engine tweaks its layout.

Step 5: Validate the response before you use it

Production code should confirm a request succeeded before it touches the payload. The usual checks are the HTTP status code, the proxy status indicator, the presence of the fields you expect, and retry logic for transient failures. The example below does the basic version with raise_for_status(), which turns a failed HTTP response into an exception you can catch rather than a silent bad parse downstream.

Crawlbase Smart AI Proxy

Your tool builds the query, the Smart AI Proxy gets it to the search engine. Point any HTTP client at one endpoint and it routes through 1M+ rotating residential IPs, applies geo targeting, and absorbs the anti-bot defenses that would otherwise block you, no proxy pool to run, no client library to learn. Add autoparse=true and the SERP comes back as structured JSON. Test it with your own queries on the free tier first.

Start free

Step 6: Build the end-to-end fetcher

Here is a minimal Google SERP fetcher that uses the Smart AI Proxy as its only path to Google. It configures the proxy with your token, sends a GET request to a Google search URL, and passes autoparse=true so the response is structured JSON. You get back original_status, cb_status (legacy pc_status), the requested url, and a body containing the parsed results. We leave out the country parameter so the snippet runs on the standard plan without any geo targeting.

python

import json
import os
import requests
from urllib.parse import quote_plus
from urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

def fetch_google_serp(token, query):
    proxy = f"https://{token}:@smartproxy.crawlbase.com:8013"
    proxies = {"http": proxy, "https": proxy}
    url = f"https://www.google.com/search?q={quote_plus(query)}"
    headers = {"CrawlbaseAPI-Parameters": "autoparse=true"}
    response = requests.get(url, headers=headers, proxies=proxies, verify=False, timeout=30)
    response.raise_for_status()
    return json.loads(response.text)

if __name__ == "__main__":
    token = os.environ.get("CRAWLBASE_TOKEN", "YOUR_CRAWLBASE_TOKEN")
    data = fetch_google_serp(token, "best coffee shops Paris")
    print(f"Returned keys: {list(data.keys())}")

That function is the data collection backbone of the whole system. It can run behind an API endpoint, inside a worker queue, or on a schedule, and it returns the same structured shape every time. The quote_plus call is the normalization from Step 1, the URL is the construction from Step 2, the proxies dict is the routing from Step 3, the header is the structured parsing from Step 4, and raise_for_status is the validation from Step 5. Six steps, one function.

Step 7: Turn the JSON into structured records

The autoparse payload is rich, but your product rarely needs all of it. The next step is to pull the body and reshape it into the lean records your storage and ranking code expects. The function below extracts the organic results into a predictable list of dictionaries with rank, title, URL, and snippet, falling back gracefully when a field is missing so one odd result does not break the batch.

python

def to_records(serp, query):
    body = serp.get("body", {})
    results = body.get("searchResults", [])
    records = []
    for rank, item in enumerate(results, start=1):
        records.append({
            "query": query,
            "rank": rank,
            "title": item.get("title", ""),
            "url": item.get("url", ""),
            "snippet": item.get("description", ""),
        })
    return records

if __name__ == "__main__":
    token = os.environ.get("CRAWLBASE_TOKEN", "YOUR_CRAWLBASE_TOKEN")
    query = "best coffee shops Paris"
    serp = fetch_google_serp(token, query)
    rows = to_records(serp, query)
    print(json.dumps(rows[:3], indent=2))

The output is a clean, consistent schema you can write straight to a database, feed into a ranking step, or return from your own API. Normalizing into a shape like this early is what lets you swap or add search engines later without rewriting everything downstream.

json

[
  {
    "query": "best coffee shops Paris",
    "rank": 1,
    "title": "The 10 Best Coffee Shops in Paris",
    "url": "https://example.com/paris-coffee",
    "snippet": "Our guide to the best specialty coffee in the city..."
  }
]

Step 8: Plug it into your application

With a fetcher and a normalizer in hand, the data feeds whatever you are building: a custom search interface, a competitive analysis tool, an SEO monitoring dashboard, a market research dataset, or training data for a model. Most systems normalize into a consistent schema before storage, exactly what Step 7 produced, so analytics and ranking have a stable shape to work against. The point is that the rest of your application never deals with proxies, CAPTCHAs, or layout changes. It deals with records.

Scaling the tool without breaking it

Scaling a SERP tool is a coordination problem across four axes, and the proxy layer makes each of them tractable.

Concurrency. Run a job queue with several workers all issuing requests through the same proxy endpoint. Rotation spreads that traffic across independent routes, so concurrency raises throughput instead of raising your block rate. For the deeper version of this, see how to rotate proxies for scraping Google search results.

Geo and device variation. When you need regional data, vary the location parameters across requests. The same query can return very different results depending on where it appears to originate, which is a feature to exploit rather than a problem to dodge.

Rate and cost control. Even with a proxy layer, unbounded traffic creates needless failures or expense. Simple client-side throttling usually settles this, keeping you inside sensible limits without complicated coordination.

Resilience. Expect the occasional transient error. Retry with backoff and watch the status codes so a temporary blip does not cascade into a larger failure. The broader anti-blocking playbook lives in how to scrape websites without getting blocked.

Limits worth knowing before you ship

This approach is a strong default, but it is not a silver bullet. Keep a few things in mind.

Autoparse covers common engines, not every page. Structured parsing is great for Google and similar SERPs, but if you scrape a niche or unusual results page you may still parse the HTML yourself. That is fine, just route the raw fetch through the same proxy and parse locally.

Heavy rendering needs a browser. If a target hides results behind client-side rendering or a stiffer challenge, raw HTML will not cut it. That is when you move that request to the Crawling API for real-browser rendering while keeping the rest of your tool unchanged.

Cost scales with volume. The reliability is constant, but requests are not free. Throttle, cache repeat queries, and keep only the fields your product needs so you are not paying to fetch and store data you will never use. For CAPTCHA-specific edge cases on Google, bypassing CAPTCHA while scraping Google goes deeper.

Recap

Key takeaways

Route everything through one endpoint. The Smart AI Proxy is the single outbound boundary of your tool, which keeps the rest of the system insulated from blocks, CAPTCHAs, and geo mismatches.
Normalize the query, centralize the URL. Encode user input with quote_plus and build search URLs in one function so adding pagination or regions later is a small change.
Let autoparse do the parsing. Pass autoparse=true in CrawlbaseAPI-Parameters and the SERP comes back as JSON, removing most manual HTML handling.
Reshape into lean records early. Turn the rich payload into a consistent schema up front so storage, ranking, and swapping search engines stay simple downstream.
Scale on four axes. Concurrency, geo variation, rate control, and retry-with-backoff together turn a demo script into something you can rely on in production.

Frequently Asked Questions (FAQs)

What is the easiest way to build a search engine tool that does not get blocked?

Route every outbound request through a rotating proxy instead of hitting search engines directly. The Smart AI Proxy gives you that as a drop-in endpoint: you configure your HTTP client to use it, send normal GET requests, and it distributes traffic across many IPs while absorbing anti-bot defenses. Your code stays focused on query logic and product features rather than proxy maintenance.

Do I need to parse the HTML myself?

Usually not for common search engines. Pass autoparse=true in the CrawlbaseAPI-Parameters header and the Smart AI Proxy returns structured JSON with organic results, ads, the local pack, and related questions already broken out. You only fall back to manual parsing for unusual result pages where no parser exists, and even then you route the raw fetch through the same proxy.

Why is SSL verification disabled when using the Smart AI Proxy?

The proxy needs to inspect traffic to apply its routing logic and response handling, so SSL verification for the destination is typically turned off. In Python that means setting verify=False on the request. It applies only to traffic going through the proxy, not to your application's other connections, and the warning that Python prints can be silenced as the example shows.

When should I use the Crawling API instead of the Smart AI Proxy?

Use the Smart AI Proxy as the default for high-volume SERP fetching with a standard HTTP client. Reach for the Crawling API when a target renders its results client-side or guards them behind heavier challenges, because it renders the page in a real browser and returns finished HTML or markdown. Many tools use the proxy for most queries and the Crawling API for the few stubborn targets.

How do I get region-specific search results?

Search results vary by location, so to get regional data you vary the location targeting per request. The same query issued from different regions returns different rankings and local listings, which you can use deliberately for localized products. Geo targeting is a premium capability, so the basic example here omits the country parameter to keep it runnable on the standard plan.

Can I use a language other than Python?

Yes. The Smart AI Proxy works with any language that can send standard HTTP requests, because it is a normal proxy endpoint with no required client library. Python is used here because it is easy to run locally, but the same pattern, configure the proxy, send the request, parse the JSON, applies directly in Node.js, Go, Java, C#, and others.

Ian Kalvin

Technical Support Engineer · Crawlbase

Technical support engineer at Crawlbase, writing from the front line of what actually breaks in production scraping and proxy setups.

Neil Zamora

Senior Architect · Crawlbase

Senior architect at Crawlbase, focused on the systems behind large-scale crawling: proxy rotation, anti-bot resilience, and the APIs that hide that complexity.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available