<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Blog | Crawlbase</title>
  
  <subtitle>Avoid captchas while scraping and crawling</subtitle>
  <link href="https://crawlbase.com/blog/atom.xml" rel="self"/>
  
  <link href="https://crawlbase.com/blog/"/>
  <updated>2026-05-06T03:54:24.763Z</updated>
  <id>https://crawlbase.com/blog/</id>
  
  <author>
    <name>Crawlbase</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Crawlbase Delivers LLM-ready Markdown for Clean Web AI Data</title>
    <link href="https://crawlbase.com/blog/crawlbase-llm-ready-markdown-web-scraping/"/>
    <id>https://crawlbase.com/blog/crawlbase-llm-ready-markdown-web-scraping/</id>
    <published>2026-05-05T17:30:47.000Z</published>
    <updated>2026-05-06T03:54:24.763Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p><strong>Direct Answer</strong>: Crawlbase now lets developers scrape web pages as clean Markdown instead of raw HTML or JSON. Add format&#x3D;md to your Crawling API request to receive Markdown, then add md_readability&#x3D;true to extract the main readable content before conversion. The result is cleaner web data that can move directly into LLM prompts, embeddings, AI agents, and RAG pipelines with far less preprocessing.</p></blockquote><span id="more"></span><p>Crawlbase delivers LLM-ready Markdown for clean web AI data through the <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawling API</a>. By adding the <code>format=md</code> parameter, developers can request web pages as Markdown instead of raw HTML. Adding <code>md_readability=true</code> further extracts the main readable content before conversion, reducing menus, scripts, and page clutter. The result is cleaner web data that can move directly into LLM prompts, RAG pipelines, embeddings, and AI agents without a separate HTML cleanup step.</p><p>To help developers test it quickly, Crawlbase also provides a ready demo project on GitHub:</p><p><a href="https://github.com/ScraperHub/crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data">ScraperHub&#x2F;crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data</a></p><p>The demo uses a lightweight Python script that reads your Crawlbase API token, requests a page with Markdown output enabled, then saves the response as a local <code>.md</code> file.</p><p>A typical page contains menus, scripts, tracking tags, sidebars, and layout markup that browsers need but models do not. Crawlbase enhances the workflow by returning cleaner content closer to the crawl itself through a practical Markdown output API built for modern AI pipelines.</p><div class="callout-banner">  <div class="banner-header">    <img      src="/blog/images/flashlight-icon-blue.png"      srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x"      alt="Flashlight Icon"    />    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">    Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating    IPs.  </p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits"      >Claim 5,000 Free Credits</a    >    <img      src="/blog/images/arrow-right-double-green.png"      srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x"      alt="Arrow right double Icon"    />  </div></div><h2 id="Table-of-Contents"><a href="#Table-of-Contents" class="headerlink" title="Table of Contents"></a>Table of Contents</h2><ul><li><a href="#Why-Markdown-Is-Better-Than-HTML-for-LLM-Pipelines">Why Markdown Is Better Than HTML for LLM Pipelines</a></li><li><a href="#How-Crawlbase-Markdown-Output-Works">How Crawlbase Markdown Output Works</a></li><li><a href="#format-md-vs-md-readability-true-Which-Mode-to-Use">Which Mode Should You Use?</a></li><li><a href="#Why-This-Matters-for-RAG-Pipelines">Why This Matters for RAG Pipelines</a></li><li><a href="#How-Crawlbase-Simplifies-Your-AI-Scraping-Stack">How Crawlbase Simplifies Your AI Scraping Stack</a></li><li><a href="#Simple-Python-Demo-Run-Crawlbase-Markdown-Output-in-Minutes">Simple Python Demo: Run Crawlbase Markdown Output in Minutes</a></li><li><a href="#What-the-Demo-Script-Outputs">What the Demo Script Outputs</a></li><li><a href="#Real-Use-Cases-for-LLM-ready-Web-Scraping">Real Use Cases for LLM-ready Web Scraping</a></li><li><a href="#Why-AI-Agents-Benefit-Most">Why AI Agents Benefit Most</a></li><li><a href="#Start-LLM-Ready-Web-Scraping-with-Crawlbase">Start LLM-Ready Web Scraping with Crawlbase</a></li><li><a href="#Frequently-Asked-Questions-FAQs">Frequently Asked Questions</a></li></ul><h2 id="Why-Markdown-Is-Better-Than-HTML-for-LLM-Pipelines"><a href="#Why-Markdown-Is-Better-Than-HTML-for-LLM-Pipelines" class="headerlink" title="Why Markdown Is Better Than HTML for LLM Pipelines"></a>Why Markdown Is Better Than HTML for LLM Pipelines</h2><p>HTML was built for rendering pages in a browser. <a href="https://www.markdownguide.org/getting-started/">Markdown</a> is much closer to what AI systems actually need: readable text with useful structure.</p><p>When raw HTML enters an LLM workflow, the model often has to sort through markup, boilerplate, and repeated page elements before it reaches the real content. That means tokens get wasted, chunking becomes messier, embeddings can become less precise, and summaries often need extra cleanup. AI agents can also become less reliable when their web tools return inconsistent or cluttered outputs.</p><p>Markdown removes most of that friction while keeping the important structure. Headings stay organized, paragraphs remain readable, lists are preserved, tables are easier to interpret, and links stay useful without being buried in code.</p><p>That makes Markdown easier to chunk, embed into a vector database, summarize, inspect manually, and pass directly into prompts or agent workflows.</p><p>For teams doing <strong>web scraping for AI</strong>, the output format is not a small detail. It directly affects downstream quality.</p><h2 id="How-Crawlbase-Markdown-Output-Works"><a href="#How-Crawlbase-Markdown-Output-Works" class="headerlink" title="How Crawlbase Markdown Output Works"></a>How Crawlbase Markdown Output Works</h2><p>Crawlbase supports native Markdown responses through the <a href="https://crawlbase.com/docs/crawling-api/">Crawling API</a>.</p><p>Simply add the <a href="https://crawlbase.com/docs/crawling-api/parameters/#format">format parameter</a> to your API request:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">format=md</span><br></pre></td></tr></table></figure><p>That tells Crawlbase to return Markdown instead of HTML.</p><p>To focus on the main page content, add:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">md_readability=true</span><br></pre></td></tr></table></figure><p>That enables readability extraction before conversion, helping remove surrounding clutter like menus, sidebars, and footer noise.</p><p>Basic cURL request format:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl <span class="string">&quot;https://api.crawlbase.com/?token=USER_TOKEN&amp;url=https%3A%2F%2Fexample.com&amp;format=md&amp;md_readability=true&quot;</span></span><br></pre></td></tr></table></figure><p>The result is cleaner <strong>LLM-ready web scraping</strong> output in one request.</p><h2 id="format-md-vs-md-readability-true-Which-Mode-to-Use"><a href="#format-md-vs-md-readability-true-Which-Mode-to-Use" class="headerlink" title="format=md vs md_readability=true: Which Mode to Use?"></a><code>format=md</code> vs <code>md_readability=true</code>: Which Mode to Use?</h2><p>Both options are useful depending on your workflow.</p><table><thead><tr><th>Request Mode</th><th>Best Use Case</th></tr></thead><tbody><tr><td><code>format=md</code></td><td>Preserve broader page context such as menus, related links, navigation</td></tr><tr><td><code>format=md&amp;md_readability=true</code></td><td>Main content extraction for LLMs, RAG, summarization</td></tr></tbody></table><p>If your goal is embeddings, search, or prompting, start with readability enabled.</p><p>If your goal is site structure analysis or broader content capture, plain Markdown may be better.</p><h2 id="Why-This-Matters-for-RAG-Pipelines"><a href="#Why-This-Matters-for-RAG-Pipelines" class="headerlink" title="Why This Matters for RAG Pipelines"></a>Why This Matters for RAG Pipelines</h2><p><a href="https://www.ibm.com/think/topics/retrieval-augmented-generation">RAG</a>, short for Retrieval-Augmented Generation, is a method that gives language models access to external knowledge before generating an answer. Instead of relying only on training data, the model retrieves relevant documents or text chunks first, then uses that context to respond.</p><p>A typical RAG workflow is simple: fetch content, split it into chunks, create embeddings, store them in a vector database, retrieve relevant passages later, then send that context to the model.</p><p>However, if the original page is filled with junk text, repeated menus, cookie banners, or irrelevant links, that noise gets chunked and indexed alongside the useful content. When that happens, retrieval quality drops and answers become weaker.</p><p>Cleaner Markdown gives your pipeline a better starting point. Each chunk is more likely to contain meaningful text instead of layout clutter, which improves retrieval and makes the final response more reliable.</p><p>That is why <strong>RAG pipeline web data</strong> quality matters long before you ever call the model.</p><h2 id="How-Crawlbase-Simplifies-Your-AI-Scraping-Stack"><a href="#How-Crawlbase-Simplifies-Your-AI-Scraping-Stack" class="headerlink" title="How Crawlbase Simplifies Your AI Scraping Stack"></a>How Crawlbase Simplifies Your AI Scraping Stack</h2><p>Without native Markdown output, many teams build something like this:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">fetch HTML</span><br><span class="line">→ parse DOM</span><br><span class="line">→ remove scripts</span><br><span class="line">→ remove styles</span><br><span class="line">→ strip navigation</span><br><span class="line">→ extract article body</span><br><span class="line">→ normalize text</span><br><span class="line">→ convert to Markdown</span><br><span class="line">→ chunk</span><br><span class="line">→ embed</span><br></pre></td></tr></table></figure><p>In this case, a website redesign can break your selectors. A new cookie banner can pollute extracted text. A parser may work well on one page template and fail on another. Suddenly, engineers are spending time fixing cleanup logic instead of improving the AI product itself.</p><p>Crawlbase reduces that overhead by moving much of the formatting work closer to the crawl.</p><p>With Markdown output enabled, the workflow becomes much simpler:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">fetch Markdown with Crawlbase</span><br><span class="line">→ validate response</span><br><span class="line">→ chunk</span><br><span class="line">→ embed</span><br></pre></td></tr></table></figure><p>This means fewer failure points and more engineering time spent on retrieval quality, prompts, agents, and product features.</p><h2 id="Simple-Python-Demo-Run-Crawlbase-Markdown-Output-in-Minutes"><a href="#Simple-Python-Demo-Run-Crawlbase-Markdown-Output-in-Minutes" class="headerlink" title="Simple Python Demo: Run Crawlbase Markdown Output in Minutes"></a>Simple Python Demo: Run Crawlbase Markdown Output in Minutes</h2><p>Crawlbase has a ready demo project on GitHub that shows how to request Markdown output and save it locally.</p><p>Repository:</p><p><a href="https://github.com/ScraperHub/crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data">ScraperHub&#x2F;crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data</a></p><p>This demo keeps the setup intentionally small so developers can test fast.</p><h3 id="Step-1-Clone-the-Demo-Repository"><a href="#Step-1-Clone-the-Demo-Repository" class="headerlink" title="Step 1: Clone the Demo Repository"></a>Step 1: Clone the Demo Repository</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ScraperHub/crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data.git</span><br><span class="line"><span class="built_in">cd</span> crawlbase-delivers-llm-ready-markdown-for-clean-web-ai-data</span><br></pre></td></tr></table></figure><h3 id="Step-2-Create-a-Virtual-Environment"><a href="#Step-2-Create-a-Virtual-Environment" class="headerlink" title="Step 2: Create a Virtual Environment"></a>Step 2: Create a Virtual Environment</h3><p>Windows PowerShell</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">python -m venv .venv</span><br><span class="line">.\.venv\Scripts\Activate.ps1</span><br></pre></td></tr></table></figure><p>macOS &#x2F; Linux</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">python3 -m venv .venv</span><br><span class="line"><span class="built_in">source</span> .venv/bin/activate</span><br></pre></td></tr></table></figure><h3 id="Step-3-Install-Requirements"><a href="#Step-3-Install-Requirements" class="headerlink" title="Step 3: Install Requirements"></a>Step 3: Install Requirements</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install -r requirements.txt</span><br></pre></td></tr></table></figure><h3 id="Step-4-Add-Your-Crawlbase-API-Token"><a href="#Step-4-Add-Your-Crawlbase-API-Token" class="headerlink" title="Step 4: Add Your Crawlbase API Token"></a>Step 4: Add Your Crawlbase API Token</h3><p>Windows PowerShell</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="variable">$env</span>:CRAWLBASE_TOKEN=<span class="string">&quot;YOUR_TOKEN&quot;</span></span><br></pre></td></tr></table></figure><p>macOS &#x2F; Linux</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">export</span> CRAWLBASE_TOKEN=<span class="string">&quot;YOUR_TOKEN&quot;</span></span><br></pre></td></tr></table></figure><h3 id="Step-5-Run-the-Demo"><a href="#Step-5-Run-the-Demo" class="headerlink" title="Step 5: Run the Demo"></a>Step 5: Run the Demo</h3><p>Use the default sample URL:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python crawlbase_markdown_demo.py</span><br></pre></td></tr></table></figure><p>Or crawl your own page:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python crawlbase_markdown_demo.py --url <span class="string">&quot;https://example.com/&quot;</span></span><br></pre></td></tr></table></figure><h3 id="Step-6-Compare-With-and-Without-Readability"><a href="#Step-6-Compare-With-and-Without-Readability" class="headerlink" title="Step 6: Compare With and Without Readability"></a>Step 6: Compare With and Without Readability</h3><p>To keep broader page content:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python crawlbase_markdown_demo.py --url <span class="string">&quot;https://example.com/&quot;</span> --no-md-readability</span><br></pre></td></tr></table></figure><h3 id="Step-7-Open-the-Output-File"><a href="#Step-7-Open-the-Output-File" class="headerlink" title="Step 7: Open the Output File"></a>Step 7: Open the Output File</h3><p>The script saves Markdown locally, usually to:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">output/page.md</span><br></pre></td></tr></table></figure><p>Open that file in any editor and inspect the result.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">      Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.    </p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a        class="banner-btn"        href="/signup?signup=blog-smart-cta"        title="Get 5,000 Free Credits"        onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'crawlbase-llm-ready-markdown-web-scraping', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});"        >Get 5,000 Free Credits</a      >    </div>  </div></div><h2 id="What-the-Demo-Script-Outputs"><a href="#What-the-Demo-Script-Outputs" class="headerlink" title="What the Demo Script Outputs"></a>What the Demo Script Outputs</h2><p>Once the demo runs successfully, it does two things: it saves the Markdown response to a local file and prints a short crawl summary in the terminal.</p><p>A typical output looks like this:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">Original status: 200</span><br><span class="line">Crawlbase status: 200</span><br><span class="line">Content-Type: text/markdown; charset=utf-8</span><br><span class="line">Markdown flavor: GitHub Flavored Markdown (GFM)</span><br><span class="line">Readability extraction: false</span><br><span class="line">Saved to: output\page.md</span><br></pre></td></tr></table></figure><p>This gives you immediate confirmation that the request worked, what the target site returned, and where the Markdown file was saved.</p><p>If a page redirects, times out, or returns incomplete content, your pipeline should know before it stores bad data or indexes weak content. Small checks at the ingestion stage can prevent bigger issues later in retrieval and answer quality.</p><img src="/blog/crawlbase-llm-ready-markdown-web-scraping/crawlbase-llm-ready-markdown-web-scraping-output.jpg" class="" title="Snapshot of the generated .md file, targeting an Amazon SERP URL."><p>The generated Markdown file can capture product titles, links, category text, navigation labels, and page structure in a readable format. Instead of raw HTML full of scripts and layout code, you get structured text that is easier to inspect and process.</p><p>That makes it far more practical for <strong>web scraping for AI</strong>, internal search tools, or cleaner <strong>RAG pipeline web data</strong> ingestion.</p><h2 id="Real-Use-Cases-for-LLM-ready-Web-Scraping"><a href="#Real-Use-Cases-for-LLM-ready-Web-Scraping" class="headerlink" title="Real Use Cases for LLM-ready Web Scraping"></a>Real Use Cases for LLM-ready Web Scraping</h2><p>Markdown output becomes useful anywhere web content needs to become model-ready context.</p><ul><li><strong>Documentation Chatbots:</strong> Keep product docs or help centers current by turning documentation pages into clean Markdown chunks for search and retrieval.</li><li><strong>AI Research Agents:</strong> Fetch articles, reports, filings, or public resources in a format models can read quickly.</li><li>Competitor Monitoring: Track pricing pages, feature pages, changelogs, and announcements without parsing raw HTML every time.</li><li><strong>Internal Search Systems:</strong> Build searchable knowledge indexes using cleaner source material from across the web.</li><li><strong>Summarization Pipelines:</strong> Convert long pages into concise summaries with less preprocessing work.</li></ul><p>These are practical examples of LLM-ready web scraping where output quality directly affects results.</p><h2 id="Why-AI-Agents-Benefit-Most"><a href="#Why-AI-Agents-Benefit-Most" class="headerlink" title="Why AI Agents Benefit Most"></a>Why AI Agents Benefit Most</h2><p>AI agents often perform better when their tools return predictable, readable outputs.</p><p>If an agent fetches raw HTML, the model has to work through tags, layout code, and clutter before it can understand the page. That wastes tokens and adds friction.</p><p>If the same tool returns readability-filtered Markdown, the model receives something much closer to a usable document from the start.</p><p>That makes it easier to summarize pages, extract fields, compare sources, decide next actions, and cite evidence. For teams building autonomous workflows, cleaner tool output often leads to a cleaner agent loop.</p><h2 id="Start-LLM-Ready-Web-Scraping-with-Crawlbase"><a href="#Start-LLM-Ready-Web-Scraping-with-Crawlbase" class="headerlink" title="Start LLM-Ready Web Scraping with Crawlbase"></a>Start LLM-Ready Web Scraping with Crawlbase</h2><p>The web has no shortage of valuable information. The real challenge is turning that information into something AI systems can use efficiently.</p><p>Raw HTML often creates unnecessary cleanup work, especially for teams building retrieval systems, AI agents, and search workflows. Crawlbase removes most of that friction by returning clean Markdown directly from the crawl itself.</p><p>That makes Crawlbase a practical Markdown-output API for teams focused on LLM-ready and modern <strong>web scraping for AI</strong> use cases. Instead of spending engineering time stripping HTML, you can move faster on chunking, embeddings, retrieval quality, and product features that matter.</p><p>For companies building search systems or retrieval workflows, cleaner source content also leads to stronger <strong>RAG pipeline web data</strong> from the start.</p><p><a href="https://crawlbase.com/signup?signup=blog">Start using Crawlbase Markdown output</a> today. Use your 1,000 free requests to test cleaner AI-ready web data on your own URLs.</p><h2 id="Frequently-Asked-Questions-FAQs"><a href="#Frequently-Asked-Questions-FAQs" class="headerlink" title="Frequently Asked Questions (FAQs)"></a>Frequently Asked Questions (FAQs)</h2><h3 id="1-What-is-LLM-ready-web-scraping"><a href="#1-What-is-LLM-ready-web-scraping" class="headerlink" title="1. What is LLM-ready web scraping?"></a>1. What is LLM-ready web scraping?</h3><p>LLM-ready web scraping means collecting web content in a format that language models can use immediately with minimal cleanup. Instead of raw HTML filled with scripts, styling, and navigation clutter, the output is cleaner, structured text such as Markdown that is easier to chunk, embed, summarize, and pass into prompts.</p><h3 id="2-Why-is-Markdown-better-than-HTML-for-RAG-pipelines"><a href="#2-Why-is-Markdown-better-than-HTML-for-RAG-pipelines" class="headerlink" title="2. Why is Markdown better than HTML for RAG pipelines?"></a>2. Why is Markdown better than HTML for RAG pipelines?</h3><p>Markdown is usually better for RAG because it preserves useful structure like headings, lists, links, and tables without unnecessary markup. That creates cleaner chunks, better embeddings, and more relevant retrieval results compared with noisy raw HTML.</p><h3 id="3-How-do-I-get-Markdown-output-from-Crawlbase"><a href="#3-How-do-I-get-Markdown-output-from-Crawlbase" class="headerlink" title="3. How do I get Markdown output from Crawlbase?"></a>3. How do I get Markdown output from Crawlbase?</h3><p>Use the Crawlbase Crawling API and add <code>format=md</code> to your request. If you also want main-content extraction before conversion, add <code>md_readability=true</code>. This returns cleaner Markdown that can be used directly in AI workflows, search systems, or agent pipelines.</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Direct Answer&lt;/strong&gt;: Crawlbase now lets developers scrape web pages as clean Markdown instead of raw HTML or JSON. Add format&amp;#x3D;md to your Crawling API request to receive Markdown, then add md_readability&amp;#x3D;true to extract the main readable content before conversion. The result is cleaner web data that can move directly into LLM prompts, embeddings, AI agents, and RAG pipelines with far less preprocessing.&lt;/p&gt;
&lt;/blockquote&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="crawlbase crawling api" scheme="https://crawlbase.com/blog/tags/crawlbase-crawling-api/"/>
    
    <category term="LLM-ready web scraping" scheme="https://crawlbase.com/blog/tags/LLM-ready-web-scraping/"/>
    
    <category term="crawlbase markdown output" scheme="https://crawlbase.com/blog/tags/crawlbase-markdown-output/"/>
    
    <category term="web scraping for ai" scheme="https://crawlbase.com/blog/tags/web-scraping-for-ai/"/>
    
    <category term="rag pipeline web data" scheme="https://crawlbase.com/blog/tags/rag-pipeline-web-data/"/>
    
    <category term="markdown output api" scheme="https://crawlbase.com/blog/tags/markdown-output-api/"/>
    
    <category term="html to markdown scraping" scheme="https://crawlbase.com/blog/tags/html-to-markdown-scraping/"/>
    
    <category term="clean web data for llms" scheme="https://crawlbase.com/blog/tags/clean-web-data-for-llms/"/>
    
    <category term="ai agent web scraping" scheme="https://crawlbase.com/blog/tags/ai-agent-web-scraping/"/>
    
    <category term="web scraping RAG pipeline" scheme="https://crawlbase.com/blog/tags/web-scraping-RAG-pipeline/"/>
    
  </entry>
  
  <entry>
    <title>Build AI Data Pipelines with LangChain &amp; Crawlbase</title>
    <link href="https://crawlbase.com/blog/build-ai-data-pipeline-with-langchain-and-crawlbase/"/>
    <id>https://crawlbase.com/blog/build-ai-data-pipeline-with-langchain-and-crawlbase/</id>
    <published>2026-04-24T02:14:58.000Z</published>
    <updated>2026-04-24T11:53:22.979Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p><strong>Direct Answer</strong>: Crawlbase integrates with LangChain as a specialized tool within an agentic workflow, enabling real-time web data retrieval during execution. This allows LLMs to fetch, process, and use live web content, producing grounded responses instead of relying solely on static training data.</p></blockquote><span id="more"></span><p>To use Crawlbase with <a href="https://docs.langchain.com/oss/python/langchain/overview">LangChain</a>, you integrate it as a tool inside a LangChain or LangGraph agent so your model can fetch real-time web data during execution.</p><p>This guide’s goal is to give you a working implementation you can run, test, and iterate on.</p><p>Instead of relying only on pre-trained knowledge, the agent can decide when it needs fresh or external information, call Crawlbase, process the response, and use that data to generate grounded outputs.</p><p>This means wrapping Crawlbase’s <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawling API</a> as a LangChain tool, registering it in a ReAct-style agent, and letting the model choose when to fetch data versus when to answer directly. The result is a simple but powerful pipeline:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">user query → agent reasoning → Crawlbase fetch → structured text → grounded response</span><br></pre></td></tr></table></figure><p>Crawlbase handles proxy rotation, blocking, and JavaScript rendering. LangChain handles orchestration. The model focuses on reasoning.</p><p>If you want to follow along with a complete, runnable project, you can clone it here: <a href="https://github.com/ScraperHub/how-to-use-crawlbase-with-langchain-for-ai-data-pipelines">ScraperHub&#x2F;how-to-use-crawlbase-with-langchain-for-ai-data-pipelines</a></p><div class="callout-banner">  <div class="banner-header">    <img      src="/blog/images/flashlight-icon-blue.png"      srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x"      alt="Flashlight Icon"    />    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">    Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating    IPs.  </p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits"      >Claim 5,000 Free Credits</a    >    <img      src="/blog/images/arrow-right-double-green.png"      srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x"      alt="Arrow right double Icon"    />  </div></div><h3 id="Jump-to-section"><a href="#Jump-to-section" class="headerlink" title="Jump to section"></a>Jump to section</h3><ul><li><a href="#Why-Use-Crawlbase-for-LangChain-Data-Pipelines">Why Use Crawlbase for LangChain Data Pipelines?</a></li><li><a href="#Architecture-The-Flow-of-a-Grounded-AI-Agent">Architecture: The Flow of a Grounded AI Agent</a></li><li><a href="#Implementation-Setting-Up-the-Project">Implementation: Setting Up the Project</a></li><li><a href="#How-to-Know-It%E2%80%99s-Working">How to Know It’s Working</a></li><li><a href="#Frequently-asked-questions">Frequently Asked Questions</a></li></ul><h2 id="Why-Use-Crawlbase-for-LangChain-Data-Pipelines"><a href="#Why-Use-Crawlbase-for-LangChain-Data-Pipelines" class="headerlink" title="Why Use Crawlbase for LangChain Data Pipelines?"></a>Why Use Crawlbase for LangChain Data Pipelines?</h2><p>A <a href="https://www.langchain.com/">LangChain</a> agent is an LLM-powered system that can decide what actions to take instead of just generating text. It doesn’t just answer questions. It can choose to call tools, fetch data, or perform multi-step reasoning based on the user’s input.</p><p>The moment you give an agent that kind of freedom, you run into a practical issue. It needs access to real data, and the web is the obvious source. That’s also where things usually start breaking.</p><p>Standard scraping approaches quickly become a problem due to blocking, dynamic content, and scaling challenges. Crawlbase solves these at the infrastructure level so your agent logic stays clean.</p><p>Instead of managing proxies, retries, or headless browsers, your agent simply calls a tool and receives structured output.</p><p>This enables a more robust system where:</p><ul><li>The agent works with clean, readable text instead of raw HTML</li><li>JavaScript-heavy pages can be fetched when needed</li><li>Failures are surfaced as structured signals, not silent errors</li><li>You avoid maintaining a separate scraping infrastructure</li></ul><p>More importantly, it improves the quality of your outputs.</p><p>Without real data, the model is guessing based on what it already knows. With Crawlbase in the loop, it can fetch current information and base its response on something concrete. That’s what turns a generic answer into something you can actually rely on.</p><h2 id="Architecture-The-Flow-of-a-Grounded-AI-Agent"><a href="#Architecture-The-Flow-of-a-Grounded-AI-Agent" class="headerlink" title="Architecture: The Flow of a Grounded AI Agent"></a>Architecture: The Flow of a Grounded AI Agent</h2><p>At a high level, this system is built around three layers, each with a very specific job.</p><ul><li><strong>CrawlbaseClient</strong> handles the actual web requests. It talks to the Crawlbase Crawling API, switches between regular and JavaScript tokens when needed, and returns structured responses.</li><li><strong>fetch_web_page tool</strong> sits in the middle. It takes the raw HTML from Crawlbase, cleans it up, and turns it into readable text that the model can work with.</li><li><strong>LangGraph agent</strong> is the decision-maker. It looks at the user’s query and decides whether it needs to fetch data or can answer directly.</li></ul><p>The flow looks like this:</p><img src="/blog/build-ai-data-pipeline-with-langchain-and-crawlbase/build-ai-data-pipeline-with-langchain-and-crawlbase-workflow.jpg" class="" title="Image of Crawlbase and LangChain Workflow"><p>When a user sends a query, the agent first tries to reason through it. If the answer requires fresh or external data, it calls the <code>fetch_web_page</code> tool.</p><p>That tool then sends a request through Crawlbase, which handles all the messy parts, such as proxies, blocking, and JavaScript rendering. Once the page is retrieved, the response is structured data.</p><p>From there, the tool strips out the HTML, cleans the content, and trims it down so it fits within the model’s limits. That cleaned text is passed back to the agent, and the model uses it to generate a grounded answer.</p><p>The key idea here is the separation of concerns.</p><p>The model focuses on reasoning. The tool focuses on preparing usable data. Crawlbase handles everything related to accessing the web.</p><p>Because each layer has a clear role, the system is easier to maintain and scale. You can change how the agent thinks without touching how data is fetched, and vice versa.</p><h2 id="Implementation-Setting-Up-the-Project"><a href="#Implementation-Setting-Up-the-Project" class="headerlink" title="Implementation: Setting Up the Project"></a>Implementation: Setting Up the Project</h2><p>Now, let’s walk through this the same way you would actually set it up on your machine.</p><h3 id="Step-1-Prerequisites"><a href="#Step-1-Prerequisites" class="headerlink" title="Step 1: Prerequisites"></a>Step 1: Prerequisites</h3><p>Before you start, make sure you have the following:</p><ul><li><a href="https://www.python.org/downloads/">Python 3.11</a> or newer, since it aligns well with modern LangChain and LangGraph setups.</li><li><a href="https://crawlbase.com/dashboard/account/docs">Crawlbase tokens</a>: Use the regular token for most HTML pages, and keep a JavaScript token ready for sites that rely on client-side rendering.</li><li>An <a href="https://platform.claude.com/docs/en/api/admin/api_keys/retrieve">Anthropic API key</a> for the sample agent. If you plan to use a different provider later, the overall pattern stays the same.</li></ul><h3 id="Step-2-Clone-the-repository"><a href="#Step-2-Clone-the-repository" class="headerlink" title="Step 2: Clone the repository"></a>Step 2: Clone the repository</h3><p>In your terminal, run:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ScraperHub/how-to-use-crawlbase-with-langchain-for-ai-data-pipelines.git</span><br><span class="line"><span class="built_in">cd</span> how-to-use-crawlbase-with-langchain-for-ai-data-pipelines</span><br></pre></td></tr></table></figure><p>This downloads the project into a new folder and gives you a complete, working project with all components already wired together.</p><h3 id="Step-3-Configure-environment-variables"><a href="#Step-3-Configure-environment-variables" class="headerlink" title="Step 3: Configure environment variables"></a>Step 3: Configure environment variables</h3><p>Create a <code>.env</code> file in the project root:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">CRAWLBASE_REGULAR_TOKEN=your_token</span><br><span class="line">CRAWLBASE_JS_TOKEN=your_js_token</span><br><span class="line">ANTHROPIC_API_KEY=your_key</span><br></pre></td></tr></table></figure><p>These credentials allow the agent to fetch web data and generate responses.</p><h3 id="Step-4-Install-dependencies"><a href="#Step-4-Install-dependencies" class="headerlink" title="Step 4: Install dependencies"></a>Step 4: Install dependencies</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install langgraph langchain langchain-core langchain-anthropic httpx python-dotenv pydantic pytest</span><br></pre></td></tr></table></figure><p>Using a virtual environment is recommended if you’re managing multiple Python projects.</p><h3 id="Optional-quick-smoke-test"><a href="#Optional-quick-smoke-test" class="headerlink" title="Optional: quick smoke test"></a>Optional: quick smoke test</h3><p>Before wiring this into LangChain, it’s a good idea to confirm your token works.</p><p>You can run a simple live test like this:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">&quot;&quot;&quot;Optional live smoke test against Crawlbase (no mocks).</span></span><br><span class="line"><span class="string">Set RUN_CRAWLBASE_LIVE=1 and valid CRAWLBASE_REGULAR_TOKEN, then run:</span></span><br><span class="line"><span class="string">    python code/scripts/smoke_crawlbase.py</span></span><br><span class="line"><span class="string">&quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> __future__ <span class="keyword">import</span> annotations</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">import</span> sys</span><br><span class="line"><span class="keyword">from</span> pathlib <span class="keyword">import</span> Path</span><br><span class="line"></span><br><span class="line"><span class="comment"># Allow imports from code/</span></span><br><span class="line">_CODE = Path(__file__).resolve().parent.parent</span><br><span class="line"><span class="keyword">if</span> <span class="built_in">str</span>(_CODE) <span class="keyword">not</span> <span class="keyword">in</span> sys.path:</span><br><span class="line">    sys.path.insert(<span class="number">0</span>, <span class="built_in">str</span>(_CODE))</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> dotenv <span class="keyword">import</span> load_dotenv</span><br><span class="line"></span><br><span class="line">load_dotenv(_CODE / <span class="string">&quot;.env&quot;</span>)</span><br><span class="line">load_dotenv(_CODE.parent / <span class="string">&quot;.env&quot;</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">main</span>() -&gt; <span class="built_in">int</span>:</span><br><span class="line">    <span class="keyword">if</span> os.environ.get(<span class="string">&quot;RUN_CRAWLBASE_LIVE&quot;</span>) != <span class="string">&quot;1&quot;</span>:</span><br><span class="line">        <span class="built_in">print</span>(<span class="string">&quot;Set RUN_CRAWLBASE_LIVE=1 to run this smoke test.&quot;</span>, file=sys.stderr)</span><br><span class="line">        <span class="keyword">return</span> <span class="number">0</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">from</span> crawlbase_client <span class="keyword">import</span> CrawlbaseClient</span><br><span class="line"></span><br><span class="line">    token = os.environ.get(<span class="string">&quot;CRAWLBASE_REGULAR_TOKEN&quot;</span>, <span class="string">&quot;&quot;</span>).strip()</span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> token:</span><br><span class="line">        <span class="built_in">print</span>(<span class="string">&quot;CRAWLBASE_REGULAR_TOKEN is required.&quot;</span>, file=sys.stderr)</span><br><span class="line">        <span class="keyword">return</span> <span class="number">1</span></span><br><span class="line"></span><br><span class="line">    url = os.environ.get(<span class="string">&quot;CRAWLBASE_SMOKE_URL&quot;</span>, <span class="string">&quot;https://example.com/&quot;</span>)</span><br><span class="line">    <span class="keyword">with</span> CrawlbaseClient.from_env() <span class="keyword">as</span> client:</span><br><span class="line">        data = client.fetch_json(url, use_javascript=<span class="literal">False</span>)</span><br><span class="line"></span><br><span class="line">    pc = data.get(<span class="string">&quot;pc_status&quot;</span>)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;OK url=<span class="subst">&#123;data.get(<span class="string">&#x27;url&#x27;</span>)&#125;</span> pc_status=<span class="subst">&#123;pc&#125;</span> original_status=<span class="subst">&#123;data.get(<span class="string">&#x27;original_status&#x27;</span>)&#125;</span>&quot;</span>)</span><br><span class="line">    body = data.get(<span class="string">&quot;body&quot;</span>, <span class="string">&quot;&quot;</span>)</span><br><span class="line">    preview = (body[:<span class="number">200</span>] + <span class="string">&quot;...&quot;</span>) <span class="keyword">if</span> <span class="built_in">isinstance</span>(body, <span class="built_in">str</span>) <span class="keyword">and</span> <span class="built_in">len</span>(body) &gt; <span class="number">200</span> <span class="keyword">else</span> body</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">&quot;body preview:&quot;</span>, preview)</span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span> <span class="keyword">if</span> pc == <span class="number">200</span> <span class="keyword">or</span> <span class="built_in">str</span>(pc) == <span class="string">&quot;200&quot;</span> <span class="keyword">else</span> <span class="number">2</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">    <span class="keyword">raise</span> SystemExit(main())</span><br></pre></td></tr></table></figure><p>This step isn’t required, but it helps you isolate issues early. If this works, you know your Crawlbase setup is correct before adding the agent on top.</p><h3 id="Step-5-Run-the-project"><a href="#Step-5-Run-the-project" class="headerlink" title="Step 5: Run the project"></a>Step 5: Run the project</h3><p>Now you can run the agent. From the same folder:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;latest AI news today&quot;</span></span><br></pre></td></tr></table></figure><p>Or pass input via stdin:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">echo</span> <span class="string">&quot;summarize https://example.com&quot;</span> | python main.py</span><br></pre></td></tr></table></figure><p>When you execute the command, you’re triggering a full agent loop.</p><p>Your query is passed to the LangGraph agent, which decides whether it can answer directly or needs external data. If it does, it calls the <code>fetch_web_page</code> tool.</p><p>That tool sends a request to Crawlbase, retrieves the page, converts it into clean text, and returns it to the agent. The model then uses that data to produce a grounded response.</p><p>This is the core behavior you’re building: an agent that can decide when to fetch real-time information and use it effectively.</p><p>For a complete breakdown of the project structure and setup options, see the <a href="https://github.com/ScraperHub/how-to-use-crawlbase-with-langchain-for-ai-data-pipelines/blob/main/README.md">README</a>.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">      Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.    </p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a        class="banner-btn"        href="/signup?signup=blog-smart-cta"        title="Get 5,000 Free Credits"        onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'build-ai-data-pipeline-with-langchain-and-crawlbase', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});"        >Get 5,000 Free Credits</a      >    </div>  </div></div><h2 id="How-to-Know-It’s-Working"><a href="#How-to-Know-It’s-Working" class="headerlink" title="How to Know It’s Working"></a>How to Know It’s Working</h2><p>Once everything is set up correctly, the output should feel noticeably different from a standard LLM response.</p><p>If you ask about recent events, the answer should reflect current information. If you pass a specific URL, the response should clearly use content from that page.</p><p>You’ll also notice the model behaves differently depending on the query. Sometimes it answers immediately. Other times, it decides to fetch data first.</p><p>If something isn’t working, the output usually tells you why. Missing tokens, blocked pages, or JavaScript-heavy sites will surface as readable messages instead of silent failures.</p><p>These aren’t problems with your setup. They’re signals that your system is reacting to real-world conditions.</p><h2 id="Conclusion-From-Static-LLMs-to-Live-Data-Agents"><a href="#Conclusion-From-Static-LLMs-to-Live-Data-Agents" class="headerlink" title="Conclusion: From Static LLMs to Live Data Agents"></a>Conclusion: From Static LLMs to Live Data Agents</h2><p>Integrating Crawlbase with LangChain turns your LLM from a static responder into a system that can access and verify real-time information.</p><p>Instead of relying on outdated knowledge or guessing, your agent can fetch live content, adapt to changes, and produce grounded answers.</p><p>This pattern becomes essential as soon as your application depends on fresh data, whether it’s news, pricing, documentation, or competitive intelligence.</p><p><a href="https://crawlbase.com/signup?signup=blog">Create a Crawlbase account</a>, generate your tokens, and drop them into the project. You get 1,000 free requests to test real queries against a real pipeline before you commit to anything</p><h2 id="Frequently-asked-questions"><a href="#Frequently-asked-questions" class="headerlink" title="Frequently asked questions"></a>Frequently asked questions</h2><h3 id="When-should-use-javascript-be-true"><a href="#When-should-use-javascript-be-true" class="headerlink" title="When should use_javascript be true?"></a>When should <code>use_javascript</code> be true?</h3><p>Use it when the content you need isn’t present in the initial HTML and is rendered client-side. This is common with modern frontend frameworks like React or sites that load content dynamically after page load.</p><p>In this setup, the model is guided by the system prompt to decide when to enable this. When <code>use_javascript=true</code>, Crawlbase automatically switches to your JavaScript token.</p><h3 id="What-if-a-site-blocks-crawlers"><a href="#What-if-a-site-blocks-crawlers" class="headerlink" title="What if a site blocks crawlers?"></a>What if a site blocks crawlers?</h3><p>When a site blocks crawling, Crawlbase returns a non-200 <code>pc_status</code>, and your tool surfaces that as a readable message instead of failing silently.</p><p>From there, the agent can adapt. It might try the same URL using JavaScript rendering, switch to a different source, or adjust its response based on what it knows. At the product level, it’s also worth planning for fallback strategies, such as pointing users to official APIs or handling edge cases manually when needed.</p><h3 id="How-do-I-scale-this-beyond-a-demo"><a href="#How-do-I-scale-this-beyond-a-demo" class="headerlink" title="How do I scale this beyond a demo?"></a>How do I scale this beyond a demo?</h3><p>Once you move past small, synchronous requests, the easiest path is to switch to Crawlbase’s <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">Enterprise Crawler</a>. It’s designed for async, high-volume workloads and fits directly into your existing setup.</p><p>You don’t need to rebuild anything. Just configure a webhook and <a href="https://crawlbase.com/docs/crawler/pushing/#pushing-data-to-the-enterprise-crawler">add a couple of parameters</a> to your current Crawling API requests.</p><p>From there, your pipeline becomes asynchronous. Your agent triggers crawls, and your system processes results as they arrive. Crawlbase continues handling the web access side, so you can focus on making your pipeline more reliable as it scales.</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Direct Answer&lt;/strong&gt;: Crawlbase integrates with LangChain as a specialized tool within an agentic workflow, enabling real-time web data retrieval during execution. This allows LLMs to fetch, process, and use live web content, producing grounded responses instead of relying solely on static training data.&lt;/p&gt;
&lt;/blockquote&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="LangChain AI agents" scheme="https://crawlbase.com/blog/tags/LangChain-AI-agents/"/>
    
    <category term="Crawlbase API" scheme="https://crawlbase.com/blog/tags/Crawlbase-API/"/>
    
    <category term="AI web scraping" scheme="https://crawlbase.com/blog/tags/AI-web-scraping/"/>
    
    <category term="real-time data for LLMs" scheme="https://crawlbase.com/blog/tags/real-time-data-for-LLMs/"/>
    
    <category term="LangGraph agent tutorial" scheme="https://crawlbase.com/blog/tags/LangGraph-agent-tutorial/"/>
    
    <category term="build AI pipelines Python" scheme="https://crawlbase.com/blog/tags/build-AI-pipelines-Python/"/>
    
    <category term="web scraping API" scheme="https://crawlbase.com/blog/tags/web-scraping-API/"/>
    
    <category term="LLM data pipelines" scheme="https://crawlbase.com/blog/tags/LLM-data-pipelines/"/>
    
    <category term="AI agent tools LangChain" scheme="https://crawlbase.com/blog/tags/AI-agent-tools-LangChain/"/>
    
    <category term="Crawlbase LangChain integration" scheme="https://crawlbase.com/blog/tags/Crawlbase-LangChain-integration/"/>
    
    <category term="dynamic data AI agents" scheme="https://crawlbase.com/blog/tags/dynamic-data-AI-agents/"/>
    
    <category term="scraping JavaScript websites" scheme="https://crawlbase.com/blog/tags/scraping-JavaScript-websites/"/>
    
    <category term="proxy scraping API" scheme="https://crawlbase.com/blog/tags/proxy-scraping-API/"/>
    
    <category term="grounded AI responses" scheme="https://crawlbase.com/blog/tags/grounded-AI-responses/"/>
    
    <category term="AI agent architecture" scheme="https://crawlbase.com/blog/tags/AI-agent-architecture/"/>
    
  </entry>
  
  <entry>
    <title>How to Scrape Google AI Mode in 2026</title>
    <link href="https://crawlbase.com/blog/how-to-scrape-google-ai-mode/"/>
    <id>https://crawlbase.com/blog/how-to-scrape-google-ai-mode/</id>
    <published>2026-04-24T01:14:01.000Z</published>
    <updated>2026-04-24T15:02:25.875Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p><strong>Direct Answer</strong>: To scrape Google AI Mode in 2026, you should avoid browser automation and instead treat it as a structured data extraction problem. Build a Google search URL with udm&#x3D;50 parameter (AI Mode), send it to the Crawlbase Crawling API using a regular token with format&#x3D;json, optionally include scraper&#x3D;google-serp, then parse and normalize the response into stable fields like response_text, citations, and links. This approach gives you reliable, machine-readable output without managing headless browsers, proxies, or UI-level parsing.</p></blockquote><span id="more"></span><p>Google’s AI Mode is changing how search results are presented. Instead of returning a list of links, it generates a direct answer supported by multiple sources, blending summaries with citations and related content.</p><p>For developers and SEO teams, this opens up a different kind of dataset. You are not just collecting rankings anymore, but actual answers tied to queries, along with the sources behind them.</p><p>This guide walks you through how to set up a Python pipeline using <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawlbase Crawling API</a> to fetch, parse, and organize AI Mode results into JSON that you can store, compare, and plug into analytics or content workflows.</p><p>For a complete, production-ready implementation, see the project repository on ScraperHub: <a href="https://github.com/ScraperHub/google-ai-mode-scraper">ScraperHub&#x2F;how-to-scrape-google-ai-mode-in-2026</a></p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">      Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.    </p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a        class="banner-btn"        href="/signup?signup=blog-smart-cta"        title="Get 5,000 Free Credits"        onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'how-to-scrape-google-ai-mode', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});"        >Get 5,000 Free Credits</a      >    </div>  </div></div><h3 id="Jump-to"><a href="#Jump-to" class="headerlink" title="Jump to:"></a>Jump to:</h3><ul><li><a href="#How-Google-AI-Mode-Works-for-Web-Scraping">How Google AI Mode works</a></li><li><a href="#Why-You-Should-Avoid-Scraping-the-Google-AI-Mode-UI">Why not scrape the UI</a></li><li><a href="#How-Crawlbase-Helps-You-Scrape-Google-AI-Mode">Crawlbase setup</a></li><li><a href="#What-Data-to-Extract-from-Google-AI-Mode-Results">What data to extract</a></li><li><a href="#Step-by-Step-Guide-to-Scrape-Google-AI-Mode-in-2026">Step-by-step guide</a></li><li><a href="#How-to-Integrate-Google-AI-Mode-Scraping-Into-Your-App">App integration</a></li><li><a href="#Understanding-the-Google-AI-Mode-JSON-Response-Structure">Response structure</a></li><li><a href="#Common-Issues-When-Web-Scraping-Google-AI-Mode-and-Fixes">Troubleshooting</a></li><li><a href="#Key-Takeaways-for-Scraping-Google-AI-Mode">Key takeaways</a></li><li><a href="#Frequently-Asked-Questions">FAQ</a></li></ul><h2 id="How-Google-AI-Mode-Works-for-Web-Scraping"><a href="#How-Google-AI-Mode-Works-for-Web-Scraping" class="headerlink" title="How Google AI Mode Works for Web Scraping"></a>How Google AI Mode Works for Web Scraping</h2><p>Google’s AI Mode is built for real users, not scrapers. The interface is dynamic, with content loading progressively and changing based on interaction. Trying to extract data directly from the UI quickly becomes unreliable.</p><p>For scraping, the more stable approach is to focus on two things: the <strong>URL that triggers AI Mode</strong> and the <strong>data returned behind the scenes</strong>. Instead of dealing with layout changes or timing issues, you work with a predictable request and a structured response.</p><p>In this guide, the sample project builds <a href="https://www.google.com/search?udm=50">Google AI Mode</a> URLs using <code>udm=50</code>, along with standard parameters like <code>q</code>, <code>gl</code>, and <code>hl</code>, and optionally <code>uule</code> for location targeting. The implementation is simple as shown below.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">&quot;&quot;&quot;Build Google Search URLs for AI Mode (udm=50).&quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> __future__ <span class="keyword">import</span> annotations</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> urllib.parse <span class="keyword">import</span> quote_plus, urlencode</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">build_google_ai_mode_search_url</span>(<span class="params"></span></span><br><span class="line"><span class="params">    query: <span class="built_in">str</span>,</span></span><br><span class="line"><span class="params">    *,</span></span><br><span class="line"><span class="params">    gl: <span class="built_in">str</span> = <span class="string">&quot;us&quot;</span>,</span></span><br><span class="line"><span class="params">    hl: <span class="built_in">str</span> = <span class="string">&quot;en&quot;</span>,</span></span><br><span class="line"><span class="params">    uule: <span class="built_in">str</span> | <span class="literal">None</span> = <span class="literal">None</span>,</span></span><br><span class="line"><span class="params"></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    Return a https://www.google.com/search URL that opens AI Mode.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    ``uule`` is optional encoded location (see Google&#x27;s uule parameter).</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> query <span class="keyword">or</span> <span class="keyword">not</span> query.strip():</span><br><span class="line">        <span class="keyword">raise</span> ValueError(<span class="string">&quot;query must be non-empty&quot;</span>)</span><br><span class="line">    params: <span class="built_in">list</span>[<span class="built_in">tuple</span>[<span class="built_in">str</span>, <span class="built_in">str</span>]] = [</span><br><span class="line">        (<span class="string">&quot;udm&quot;</span>, <span class="string">&quot;50&quot;</span>),</span><br><span class="line">        (<span class="string">&quot;q&quot;</span>, query.strip()),</span><br><span class="line">        (<span class="string">&quot;gl&quot;</span>, gl),</span><br><span class="line">        (<span class="string">&quot;hl&quot;</span>, hl),</span><br><span class="line">    ]</span><br><span class="line">    <span class="keyword">if</span> uule:</span><br><span class="line">        params.append((<span class="string">&quot;uule&quot;</span>, uule))</span><br><span class="line">    qs = urlencode(params, quote_via=quote_plus, safe=<span class="string">&quot;&quot;</span>)</span><br><span class="line">    <span class="keyword">return</span> <span class="string">f&quot;https://www.google.com/search?<span class="subst">&#123;qs&#125;</span>&quot;</span></span><br></pre></td></tr></table></figure><p><strong>Source:</strong> <a href="https://github.com/ScraperHub/google-ai-mode-scraper/blob/main/google_ai_mode/google_ai_mode_url.py">google_ai_mode&#x2F;google_ai_mode_url.py</a></p><p>This function acts as the entry point of the pipeline. You pass in a query and get a consistent AI Mode URL in return. From there, the rest of the workflow is straightforward: send the request through Crawlbase, then normalize the response into structured data your system can use.</p><h2 id="Why-You-Should-Avoid-Scraping-the-Google-AI-Mode-UI"><a href="#Why-You-Should-Avoid-Scraping-the-Google-AI-Mode-UI" class="headerlink" title="Why You Should Avoid Scraping the Google AI Mode UI"></a>Why You Should Avoid Scraping the Google AI Mode UI</h2><p>You can scrape AI Mode by automating a browser, but it comes with trade-offs.</p><p>Once you go down that route, you have to deal with rendering delays, timing issues, and selectors that break whenever Google updates the interface. On top of that, there is bot detection to manage and the overhead of running and maintaining browser instances at scale.</p><p>It works for small setups, but it becomes fragile and expensive as you grow.</p><p>A JSON-first approach simplifies the entire flow. Instead of reproducing user interactions, you reduce it to:</p><p>request → response → parse</p><p>No browser layer, no UI dependencies, with far fewer points of failure.</p><h2 id="How-Crawlbase-Helps-You-Scrape-Google-AI-Mode"><a href="#How-Crawlbase-Helps-You-Scrape-Google-AI-Mode" class="headerlink" title="How Crawlbase Helps You Scrape Google AI Mode"></a>How Crawlbase Helps You Scrape Google AI Mode</h2><p>Crawlbase handles the data acquisition layer. It is not just forwarding requests. It takes care of fetching the page, dealing with blocking, and returning a structured response you can work with.</p><p>In this setup, the HTTP client stays intentionally simple. You send a GET request to <code>https://api.crawlbase.com/</code> with a few parameters: <code>token</code>, <code>url</code>, and <code>format=json</code>. You can also include <code>scraper=google-serp</code> if you want Crawlbase to apply <a href="https://crawlbase.com/docs/crawling-api/scrapers/">page-specific parsing</a>. The sample CLI uses this by default unless you disable it.</p><p>The implementation from the sample project looks like this:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">&quot;&quot;&quot;Minimal Crawlbase Crawling API client.&quot;&quot;&quot;</span></span><br><span class="line"><span class="keyword">from</span> __future__ <span class="keyword">import</span> annotations</span><br><span class="line"><span class="keyword">from</span> typing <span class="keyword">import</span> <span class="type">Any</span></span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line">CRAWLBASE_API = <span class="string">&quot;https://api.crawlbase.com/&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">fetch_crawlbase_json</span>(<span class="params"></span></span><br><span class="line"><span class="params">    target_url: <span class="built_in">str</span>,</span></span><br><span class="line"><span class="params">    *,</span></span><br><span class="line"><span class="params">    token: <span class="built_in">str</span>,</span></span><br><span class="line"><span class="params">    scraper: <span class="built_in">str</span> | <span class="literal">None</span> = <span class="literal">None</span>,</span></span><br><span class="line"><span class="params">    response_format: <span class="built_in">str</span> = <span class="string">&quot;json&quot;</span>,</span></span><br><span class="line"><span class="params">    timeout: <span class="built_in">float</span> = <span class="number">90.0</span>,</span></span><br><span class="line"><span class="params">    extra_params: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="type">Any</span>] | <span class="literal">None</span> = <span class="literal">None</span>,</span></span><br><span class="line"><span class="params"></span>) -&gt; <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="type">Any</span>]:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    GET Crawling API with ``format=json``.</span></span><br><span class="line"><span class="string">    Returns the parsed top-level JSON (``original_status``, ``pc_status``, ``url``, ``body``, ...).</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    params: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="type">Any</span>] = &#123;</span><br><span class="line">        <span class="string">&quot;token&quot;</span>: token,</span><br><span class="line">        <span class="string">&quot;url&quot;</span>: target_url,</span><br><span class="line">        <span class="string">&quot;format&quot;</span>: response_format,</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> scraper <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span> <span class="keyword">and</span> scraper != <span class="string">&quot;&quot;</span>:</span><br><span class="line">        params[<span class="string">&quot;scraper&quot;</span>] = scraper</span><br><span class="line">    <span class="keyword">if</span> extra_params:</span><br><span class="line">        <span class="keyword">for</span> k, v <span class="keyword">in</span> extra_params.items():</span><br><span class="line">            <span class="keyword">if</span> v <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line">            params[k] = <span class="string">&quot;true&quot;</span> <span class="keyword">if</span> v <span class="keyword">is</span> <span class="literal">True</span> <span class="keyword">else</span> (<span class="string">&quot;false&quot;</span> <span class="keyword">if</span> v <span class="keyword">is</span> <span class="literal">False</span> <span class="keyword">else</span> v)</span><br><span class="line">    headers = &#123;<span class="string">&quot;Accept-Encoding&quot;</span>: <span class="string">&quot;gzip, deflate&quot;</span>&#125;</span><br><span class="line">    resp = requests.get(CRAWLBASE_API, params=params, headers=headers, timeout=timeout)</span><br><span class="line">    resp.raise_for_status()</span><br><span class="line">    <span class="keyword">return</span> resp.json()</span><br></pre></td></tr></table></figure><p><strong>Source:</strong> <a href="https://github.com/ScraperHub/google-ai-mode-scraper/blob/main/google_ai_mode/crawlbase_client.py">google_ai_mode&#x2F;crawlbase_client.py</a></p><p>At this point, you are no longer dealing with browser state or raw HTML. You receive structured JSON that includes the page content and metadata, which can go straight into your parser.</p><p>One important detail to keep in mind: even when you request an AI Mode URL, the <code>google-serp</code> scraper may return a more traditional SERP-shaped JSON. That is expected. The sample normalizer is designed to handle both formats.</p><p>This is what makes the setup practical. You are not tightly coupled to one response format, and you do not need to constantly chase UI changes.</p><p>At a high level, the pipeline looks like this:</p><img src="/blog/how-to-scrape-google-ai-mode/how-to-scrape-google-ai-mode-pipeline.jpg" class="" title="A visual diagram showing the data pipeline for scraping Google AI Mode: starts with a query, is converted to an AI Mode URL, sent through Crawlbase, normalized into structured JSON, and finally outputted to file or downstream systems"><p>You start with a query, convert it into an AI Mode URL, send it through Crawlbase, then normalize the JSON into structured output. From there, the data can be written to a file, stored, or passed into downstream systems.</p><h2 id="What-Data-to-Extract-from-Google-AI-Mode-Results"><a href="#What-Data-to-Extract-from-Google-AI-Mode-Results" class="headerlink" title="What Data to Extract from Google AI Mode Results"></a>What Data to Extract from Google AI Mode Results</h2><p>Once you have the JSON response, the next step is deciding what data actually matters. You are not trying to capture everything in the payload. You want a small set of fields that are stable and useful.</p><p>In this setup, the output is normalized into three core fields:</p><table><thead><tr><th>Data</th><th>Field</th></tr></thead><tbody><tr><td>Summary text</td><td><code>results[0].content.response_text</code></td></tr><tr><td>Citations (URL + snippet)</td><td><code>results[0].content.citations</code></td></tr><tr><td>Reference links</td><td><code>results[0].content.links</code></td></tr></tbody></table><p>These map directly to how AI Mode works. You get a generated answer, a set of sources backing that answer, and a broader set of links related to the query.</p><p>The extraction logic is handled in the normalizer. Instead of relying on fixed keys, it looks for multiple possible fields and falls back when needed. This is important because the response shape can vary depending on how Crawlbase or Google structures the payload.</p><p>Here is the core extraction function:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">extract_content_fields</span>(<span class="params">parsed_body: <span class="type">Any</span></span>) -&gt; <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="type">Any</span>]:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    From a parsed ``body`` (dict/list/str), extract ``prompt``, ``response_text``,</span></span><br><span class="line"><span class="string">    ``citations``, ``links``, and optional ``parse_status_code``.</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    root = _parse_body_field(parsed_body)</span><br><span class="line">    serp = _adapt_crawlbase_google_serp(root) <span class="keyword">if</span> <span class="built_in">isinstance</span>(root, <span class="built_in">dict</span>) <span class="keyword">else</span> <span class="literal">None</span></span><br><span class="line"></span><br><span class="line">    prompt = _deep_find_first_str(root, (<span class="string">&quot;prompt&quot;</span>, <span class="string">&quot;query&quot;</span>, <span class="string">&quot;q&quot;</span>, <span class="string">&quot;search_query&quot;</span>))</span><br><span class="line">    response_text = _deep_find_first_str(</span><br><span class="line">        root,</span><br><span class="line">        (</span><br><span class="line">            <span class="string">&quot;response_text&quot;</span>,</span><br><span class="line">            <span class="string">&quot;result_text&quot;</span>,</span><br><span class="line">            <span class="string">&quot;answer&quot;</span>,</span><br><span class="line">            <span class="string">&quot;text&quot;</span>,</span><br><span class="line">            <span class="string">&quot;ai_overview&quot;</span>,</span><br><span class="line">            <span class="string">&quot;snippet&quot;</span>,</span><br><span class="line">        ),</span><br><span class="line">    )</span><br><span class="line">    citations = _deep_find_list_of_linkish(root, (<span class="string">&quot;citations&quot;</span>, <span class="string">&quot;sources&quot;</span>, <span class="string">&quot;references&quot;</span>))</span><br><span class="line">    links = _deep_find_list_of_linkish(root, (<span class="string">&quot;links&quot;</span>, <span class="string">&quot;related_links&quot;</span>, <span class="string">&quot;organic_links&quot;</span>))</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> links <span class="keyword">and</span> <span class="built_in">isinstance</span>(root, <span class="built_in">dict</span>):</span><br><span class="line">        alt: <span class="built_in">list</span>[<span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">str</span>]] = []</span><br><span class="line">        <span class="keyword">for</span> key <span class="keyword">in</span> (<span class="string">&quot;organic&quot;</span>, <span class="string">&quot;results&quot;</span>, <span class="string">&quot;searchResults&quot;</span>, <span class="string">&quot;peopleAlsoAsk&quot;</span>):</span><br><span class="line">            <span class="keyword">if</span> key <span class="keyword">in</span> root:</span><br><span class="line">                _collect_link_dicts(root[key], alt)</span><br><span class="line">        <span class="keyword">if</span> alt:</span><br><span class="line">            links = alt[:<span class="number">200</span>]</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> serp:</span><br><span class="line">        <span class="keyword">if</span> serp.get(<span class="string">&quot;citations&quot;</span>):</span><br><span class="line">            citations = serp[<span class="string">&quot;citations&quot;</span>]</span><br><span class="line">        <span class="keyword">if</span> serp.get(<span class="string">&quot;links&quot;</span>):</span><br><span class="line">            links = serp[<span class="string">&quot;links&quot;</span>]</span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> response_text <span class="keyword">and</span> serp.get(<span class="string">&quot;response_text&quot;</span>):</span><br><span class="line">            response_text = serp[<span class="string">&quot;response_text&quot;</span>]</span><br><span class="line"></span><br><span class="line">    parse_code = <span class="literal">None</span></span><br><span class="line">    <span class="keyword">if</span> <span class="built_in">isinstance</span>(root, <span class="built_in">dict</span>) <span class="keyword">and</span> <span class="string">&quot;parse_status_code&quot;</span> <span class="keyword">in</span> root:</span><br><span class="line">        parse_code = root.get(<span class="string">&quot;parse_status_code&quot;</span>)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> &#123;</span><br><span class="line">        <span class="string">&quot;prompt&quot;</span>: prompt <span class="keyword">or</span> <span class="string">&quot;&quot;</span>,</span><br><span class="line">        <span class="string">&quot;response_text&quot;</span>: response_text <span class="keyword">or</span> <span class="string">&quot;&quot;</span>,</span><br><span class="line">        <span class="string">&quot;citations&quot;</span>: citations,</span><br><span class="line">        <span class="string">&quot;links&quot;</span>: links,</span><br><span class="line">        <span class="string">&quot;parse_status_code&quot;</span>: parse_code,</span><br><span class="line">    &#125;</span><br></pre></td></tr></table></figure><p><strong>Source:</strong> <a href="https://github.com/ScraperHub/google-ai-mode-scraper/blob/main/google_ai_mode/normalize.py">google_ai_mode&#x2F;normalize.py</a></p><p>This approach keeps your parser flexible. It does not assume a single response format, and it continues to work even when the payload shifts between AI-style responses and more traditional SERP structures.</p><ul><li><code>response_text</code> is the generated answer you can analyze or display</li><li><code>citations</code> are the sources backing that answer</li><li><code>links</code> give you the broader set of related results</li></ul><p>If you are building dashboards or pipelines, this structure is enough to support most use cases without over complicating your schema.</p><h2 id="Step-by-Step-Guide-to-Scrape-Google-AI-Mode-in-2026"><a href="#Step-by-Step-Guide-to-Scrape-Google-AI-Mode-in-2026" class="headerlink" title="Step-by-Step Guide to Scrape Google AI Mode in 2026"></a>Step-by-Step Guide to Scrape Google AI Mode in 2026</h2><p>The fastest way to get started is to run our sample project locally. It handles URL generation, Crawlbase requests, and normalization out of the box.</p><p>You will need the latest version of <a href="https://www.python.org/downloads/">Python</a> (3.10 or higher) installed to run this project, along with a <a href="https://crawlbase.com/login">Crawlbase account</a> and a regular <a href="https://crawlbase.com/dashboard/account/docs">Crawling API token</a>.</p><h3 id="Step-1-Clone-the-repository"><a href="#Step-1-Clone-the-repository" class="headerlink" title="Step 1: Clone the repository"></a>Step 1: Clone the repository</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ScraperHub/google-ai-mode-scraper.git</span><br></pre></td></tr></table></figure><p>This gives you the full working implementation, including the CLI and parsing logic.</p><p>Inside the repository, the actual code lives in the code directory. Move into it:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cd</span> google-ai-mode-scraper</span><br></pre></td></tr></table></figure><p>You should now see:</p><ul><li><code>requirements.txt</code></li><li><code>.env.example</code></li><li><code>google_ai_mode/</code></li></ul><p>All remaining steps should be run from this directory.</p><h3 id="Step-2-Set-up-a-virtual-environment"><a href="#Step-2-Set-up-a-virtual-environment" class="headerlink" title="Step 2: Set up a virtual environment"></a>Step 2: Set up a virtual environment</h3><p>Set up an isolated Python environment so dependencies do not conflict with your system packages:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python -m venv .venv</span><br></pre></td></tr></table></figure><p>Activate it:</p><ul><li>Windows (PowerShell)</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">.venv\Scripts\Activate.ps1</span><br></pre></td></tr></table></figure><ul><li>macOS &#x2F; Linux</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">source</span> .venv/bin/activate</span><br></pre></td></tr></table></figure><p>Once activated, your terminal should show <code>(.venv)</code> indicating that Python and pip are scoped to this project.</p><h3 id="Step-3-Install-dependencies"><a href="#Step-3-Install-dependencies" class="headerlink" title="Step 3: Install dependencies"></a>Step 3: Install dependencies</h3><p>Install the required Python packages:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install -r requirements.txt</span><br></pre></td></tr></table></figure><p>This installs everything needed to:</p><ul><li>call the Crawlbase Crawling API</li><li>parse responses</li><li>run the CLI tool</li></ul><h3 id="Step-4-Configure-your-Crawlbase-token"><a href="#Step-4-Configure-your-Crawlbase-token" class="headerlink" title="Step 4: Configure your Crawlbase token"></a>Step 4: Configure your Crawlbase token</h3><p>Copy the environment template:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cp</span> .env.example .<span class="built_in">env</span></span><br></pre></td></tr></table></figure><p>Windows:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">copy .env.example .<span class="built_in">env</span></span><br></pre></td></tr></table></figure><p>Open the <code>.env</code> file and set your token:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">CRAWLBASE_REGULAR_TOKEN=your_token_here</span><br></pre></td></tr></table></figure><p>Make sure:</p><ul><li>you are using the <strong>regular token</strong> (Non-Browser-Enabled API Key), not the JavaScript token (Browser Enabled API Key)</li><li>the file is saved in the same directory as <code>requirements.txt</code></li></ul><p>The project uses <a href="https://pypi.org/project/python-dotenv/">python-dotenv</a>, so this value will be loaded automatically when you run the script.</p><h3 id="Step-5-Run-the-scraper"><a href="#Step-5-Run-the-scraper" class="headerlink" title="Step 5: Run the scraper"></a>Step 5: Run the scraper</h3><p>With everything set up, run the CLI:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python -m google_ai_mode <span class="string">&quot;your search query&quot;</span></span><br></pre></td></tr></table></figure><p>Example:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python -m google_ai_mode <span class="string">&quot;best ai tools for developers&quot;</span></span><br></pre></td></tr></table></figure><p>What happens here:</p><ul><li>the query is converted into an AI Mode URL</li><li>Crawlbase fetches the data</li><li>the response is normalized into structured JSON</li></ul><p>The result is printed directly in your terminal.</p><h3 id="Step-6-Save-output-to-a-file"><a href="#Step-6-Save-output-to-a-file" class="headerlink" title="Step 6: Save output to a file"></a>Step 6: Save output to a file</h3><p>If you want to store the result instead of just printing it:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python -m google_ai_mode <span class="string">&quot;your query&quot;</span> &gt; output.json</span><br></pre></td></tr></table></figure><p>This writes the full JSON response to <code>output.json</code>, which you can inspect or load into other tools.</p><h3 id="Step-7-Run-without-passing-a-query-optional"><a href="#Step-7-Run-without-passing-a-query-optional" class="headerlink" title="Step 7: Run without passing a query (optional)"></a>Step 7: Run without passing a query (optional)</h3><p>You can define a default query in <code>.env</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">GOOGLE_AI_MODE_QUERY=your query here</span><br></pre></td></tr></table></figure><p>Then run:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python -m google_ai_mode</span><br></pre></td></tr></table></figure><p>This is useful for testing or scheduled runs where you do not want to pass arguments each time.</p><h3 id="Step-8-Adjust-parameters"><a href="#Step-8-Adjust-parameters" class="headerlink" title="Step 8: Adjust parameters"></a>Step 8: Adjust parameters</h3><p>The CLI exposes a few options to control the request:</p><table><thead><tr><th>Option</th><th>What it does</th></tr></thead><tbody><tr><td><code>--gl</code></td><td>Sets the country (default: us)</td></tr><tr><td><code>--hl</code></td><td>Sets the language (default: en)</td></tr><tr><td><code>--no-scraper</code></td><td>Disables <code>scraper=google-serp</code></td></tr></tbody></table><p>Example:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python -m google_ai_mode <span class="string">&quot;ai seo tools&quot;</span> --gl uk --hl en</span><br></pre></td></tr></table></figure><p>This lets you test how results change across regions or configurations.</p><p>Visit the README page for the complete instructions: <a href="https://github.com/ScraperHub/google-ai-mode-scraper/blob/main/README.md">https://github.com/ScraperHub/google-ai-mode-scraper</a></p><h2 id="How-to-Integrate-Google-AI-Mode-Scraping-Into-Your-App"><a href="#How-to-Integrate-Google-AI-Mode-Scraping-Into-Your-App" class="headerlink" title="How to Integrate Google AI Mode Scraping Into Your App"></a>How to Integrate Google AI Mode Scraping Into Your App</h2><p>If you prefer integrating this into your own code instead of running the CLI, the project exposes a single high-level function.</p><p>The orchestration logic lives in <a href="https://github.com/ScraperHub/google-ai-mode-scraper/blob/main/google_ai_mode/google_ai_mode_scrape.py">google_ai_mode&#x2F;google_ai_mode_scrape.py</a>, but you only need to import one function:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> google_ai_mode <span class="keyword">import</span> scrape_google_ai_mode</span><br><span class="line">data = scrape_google_ai_mode(<span class="string">&quot;example query&quot;</span>, gl=<span class="string">&quot;us&quot;</span>, hl=<span class="string">&quot;en&quot;</span>)</span><br></pre></td></tr></table></figure><p>This call handles the full pipeline:</p><ul><li>builds the AI Mode URL</li><li>sends the request through Crawlbase</li><li>parses and normalizes the response</li></ul><p>The function automatically loads <code>CRAWLBASE_REGULAR_TOKEN</code> from <code>.env</code> if available, or falls back to your environment variables.</p><p>The result is the same structured JSON used throughout this guide, including <code>response_text</code>, <code>citations</code>, and <code>links</code>, so you can plug it directly into your application without additional parsing.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">      Try our <strong class="text-underline">Crawling API</strong> to automate your data collection — used by 70k+ dev      teams    </p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a        class="banner-btn"        href="/signup?signup=blog-smart-cta"        title="Get Started Now!"        onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'how-to-scrape-google-ai-mode', 'cta_type': 'try_crawling_api', 'cta_position': 'top','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/how-to-scrape-google-ai-mode/', 'page_title': 'How to Scrape Google AI Mode in 2026' });"        >Get Started Now!</a      >    </div>  </div></div><h2 id="Understanding-the-Google-AI-Mode-JSON-Response-Structure"><a href="#Understanding-the-Google-AI-Mode-JSON-Response-Structure" class="headerlink" title="Understanding the Google AI Mode JSON Response Structure"></a>Understanding the Google AI Mode JSON Response Structure</h2><p>The response follows a consistent structure, with a <code>results</code> array containing a single item. Most of the data you need lives inside that object.</p><p>Key fields include:</p><ul><li><code>results[0].content</code> → <code>prompt</code>, <code>response_text</code>, <code>citations</code>, <code>links</code>, <code>parse_status_code</code></li><li><code>results[0].url</code> → the AI Mode URL that was requested</li><li><code>results[0].status_code</code>, <code>pc_status</code>, <code>crawl_url</code>, <code>token_used</code>, <code>scraper</code></li><li><code>results[0].raw_body_preview</code> → a short preview of the raw response for debugging</li></ul><p>You will spend most of your time working with <code>response_text</code>, <code>citations</code>, and <code>links</code>.</p><p>If you are building dashboards or pipelines, keep <code>status_code</code> and <code>pc_status</code> alongside your extracted fields. This makes it easier to tell whether an issue comes from your parser or from the fetch layer.</p><h2 id="Common-Issues-When-Web-Scraping-Google-AI-Mode-and-Fixes"><a href="#Common-Issues-When-Web-Scraping-Google-AI-Mode-and-Fixes" class="headerlink" title="Common Issues When Web Scraping Google AI Mode (and Fixes)"></a>Common Issues When Web Scraping Google AI Mode (and Fixes)</h2><p>Scraping Google surfaces is not something you set up once and forget. Payloads change, response shapes shift, and your parser needs to be flexible enough to handle that.</p><p>The most common issues you will run into are straightforward:</p><ul><li><strong>Missing token errors</strong><br>Make sure <code>CRAWLBASE_REGULAR_TOKEN</code> is set in <code>.env</code> or your environment, and that you are running the script from the correct directory so it can be loaded properly</li><li><strong>401 or Crawlbase request errors</strong><br>Double-check that you are using a regular Crawling API token and that your account has available credits. Review Crawlbase <a href="https://crawlbase.com/docs/status-codes/">response codes</a> to understand various API error codes.</li><li><strong>Incomplete or unexpected output</strong><br>If <code>response_text</code> looks empty or <code>citations</code> seem off, inspect <code>raw_body_preview</code> and the full response body. Both Google and Crawlbase payloads evolve, so your parser may need adjustments in <a href="https://github.com/ScraperHub/google-ai-mode-scraper/blob/main/google_ai_mode/normalize.py">google_ai_mode&#x2F;normalize.py</a></li></ul><p>When results suddenly drop or look different, compare recent outputs with previous ones, especially the <code>raw_body_preview</code>. That usually tells you whether the issue is in your parsing logic or upstream in the response.</p><div class="callout-banner">  <div class="banner-header">    <img      src="/blog/images/flashlight-icon-blue.png"      srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x"      alt="Flashlight Icon"    />    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">    Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating    IPs.  </p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits"      >Claim 5,000 Free Credits</a    >    <img      src="/blog/images/arrow-right-double-green.png"      srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x"      alt="Arrow right double Icon"    />  </div></div><h2 id="Key-Takeaways-for-Scraping-Google-AI-Mode"><a href="#Key-Takeaways-for-Scraping-Google-AI-Mode" class="headerlink" title="Key Takeaways for Scraping Google AI Mode"></a>Key Takeaways for Scraping Google AI Mode</h2><p>Google AI Mode shifts scraping from extracting links to working with structured answers, citations, and context. Instead of relying on fragile UI automation, you can build AI Mode URLs, fetch JSON through Crawlbase, and normalize the response into fields you can actually use.</p><p>This approach keeps the pipeline simple and stable. It also makes the data immediately usable for tracking answer changes, analyzing cited sources, and feeding into SEO or internal workflows.</p><p>If you want to try this yourself, start with the sample project and run it locally. <a href="https://crawlbase.com/signup?signup=blog">Create a Crawlbase account</a>, get your regular token or API key, add it to your <code>.env</code>, and run a few queries. Within minutes, you will have structured AI Mode data ready to store, compare, and build on.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="What-is-udm-50"><a href="#What-is-udm-50" class="headerlink" title="What is udm&#x3D;50?"></a>What is udm&#x3D;50?</h3><p><code>udm=50</code> is a Google search parameter that triggers AI Mode in the search results. When included in the query URL, it returns the AI-generated response layer instead of the traditional list of links.</p><p>For example:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">https://www.google.com/search?q=web+scraping&amp;udm=50&amp;gl=us&amp;hl=en</span><br></pre></td></tr></table></figure><p>Opening this URL in a browser loads the AI Mode version of the results for the query “web scraping”.</p><h3 id="Does-Crawlbase-support-Google-AI-Mode"><a href="#Does-Crawlbase-support-Google-AI-Mode" class="headerlink" title="Does Crawlbase support Google AI Mode?"></a>Does Crawlbase support Google AI Mode?</h3><p>Yes. Crawlbase can fetch Google AI Mode results by requesting the AI Mode URL and returning the response as structured JSON. While the <code>google-serp</code> scraper may sometimes return a traditional SERP-shaped payload, the data can still be normalized into fields like <code>response_text</code>, <code>citations</code>, and <code>links</code> using the approach shown in this guide.</p><h3 id="What-token-type-do-I-need"><a href="#What-token-type-do-I-need" class="headerlink" title="What token type do I need?"></a>What token type do I need?</h3><p>You need a <strong>regular Crawling API token</strong> (Non-Browser-Enabled API Key), not a JavaScript token (Browser Enabled API Key). The setup in this guide relies on Crawlbase handling the request and returning JSON directly, so there is no need for browser rendering.</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Direct Answer&lt;/strong&gt;: To scrape Google AI Mode in 2026, you should avoid browser automation and instead treat it as a structured data extraction problem. Build a Google search URL with udm&amp;#x3D;50 parameter (AI Mode), send it to the Crawlbase Crawling API using a regular token with format&amp;#x3D;json, optionally include scraper&amp;#x3D;google-serp, then parse and normalize the response into stable fields like response_text, citations, and links. This approach gives you reliable, machine-readable output without managing headless browsers, proxies, or UI-level parsing.&lt;/p&gt;
&lt;/blockquote&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="scrape google ai mode" scheme="https://crawlbase.com/blog/tags/scrape-google-ai-mode/"/>
    
    <category term="what is google ai mode" scheme="https://crawlbase.com/blog/tags/what-is-google-ai-mode/"/>
    
    <category term="why scrape google ai mode data" scheme="https://crawlbase.com/blog/tags/why-scrape-google-ai-mode-data/"/>
    
  </entry>
  
  <entry>
    <title>AI Proxy for Enterprise: Scale, Security, and Operational Efficiency</title>
    <link href="https://crawlbase.com/blog/ai-proxy-for-enterprises/"/>
    <id>https://crawlbase.com/blog/ai-proxy-for-enterprises/</id>
    <published>2026-04-22T17:37:02.000Z</published>
    <updated>2026-04-24T11:53:22.819Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p><strong>Direct Answer</strong>: Enterprise teams adopt AI proxy infrastructure for three reasons: sustaining high success rates at scale, meeting compliance and security requirements, and cutting the engineering overhead that kills productivity on large, complex target sets.</p></blockquote><span id="more"></span><p>Standard proxy infrastructure wasn’t built for enterprise data collection. It was built for a simpler problem. When you’re running tens of millions of requests daily, feeding pricing models, risk systems, or supply chain dashboards, the ceiling on rule-based proxies becomes a real operational problem, not a theoretical one.</p><p><a href="https://crawlbase.com/blog/what-is-an-ai-proxy/">AI proxy technology</a> replaces static proxy logic with adaptive machine learning. Instead of applying configurations that were accurate last week, it learns what works against each target in real time. That shift matters more at enterprise scale than anywhere else, because the cost of failure compounds fast.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Try our Enterprise AI Proxy</h2>  </div>  <p class="banner-body">Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating IPs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Why-Enterprise-Data-Collection-Is-a-Different-Problem"><a href="#Why-Enterprise-Data-Collection-Is-a-Different-Problem" class="headerlink" title="Why Enterprise Data Collection Is a Different Problem"></a>Why Enterprise Data Collection Is a Different Problem</h2><p>Most enterprise data teams don’t hit a wall immediately. They build solid scraping infrastructure, stack up large IP pools, get their pipelines running, and things work fine until a target updates its anti-bot stack, they expand to new domains, or request volumes cross the threshold where behavioral detection kicks in.</p><p>At that point, rule-based proxies leave you with bad choices: burn engineering time diagnosing and reconfiguring, accept lower data quality, or reduce collection frequency. None of those options are acceptable when the data feeds competitive pricing decisions, market intelligence, or risk monitoring.</p><p>The architectural issue is straightforward. Rule-based proxies respond to the web as it was, not as it is. Targets update their anti-bot platforms on irregular schedules without warning. When they do, static configurations fail until someone manually fixes them. <a href="https://crawlbase.com/blog/how-ai-proxies-work/">How AI proxies work</a> differently is the core of why enterprise teams are switching.</p><h2 id="Scale-and-Reliability"><a href="#Scale-and-Reliability" class="headerlink" title="Scale and Reliability"></a>Scale and Reliability</h2><p>At enterprise volumes, small differences in success rates have large downstream consequences. A 5% drop across 10 million daily requests is 500,000 failed data points, gaps in pricing coverage, incomplete market data, and missing records that degrade model accuracy.</p><p>AI proxy infrastructure maintains high success rates at scale through three mechanisms:</p><ul><li><strong>Per-target model learning:</strong> The system builds a model for each target domain and continuously updates it. It learns which IP types, fingerprint configurations, and session parameters work best against that specific target. As request volume grows, the model gets sharper, the opposite of what happens with rule-based systems under load.</li><li><strong>Automatic adaptation when targets change:</strong> When a target updates its anti-bot stack, the AI proxy detects the shift in success rates and adjusts automatically. Enterprise teams don’t need to monitor per-domain performance and manually intervene when something breaks.</li><li><strong>Session management at volume:</strong> High-throughput operations run thousands of concurrent sessions. Managing realistic behavioral patterns across all of them simultaneously, without triggering rate limits or session-based detection, requires the kind of coordination that rule-based proxies can’t provide.</li></ul><h2 id="Compliance-and-Security"><a href="#Compliance-and-Security" class="headerlink" title="Compliance and Security"></a>Compliance and Security</h2><p>Consumer-grade proxy infrastructure doesn’t address enterprise compliance requirements. Data residency obligations, access controls, audit logging, and contractual sourcing requirements have to be designed into the proxy layer; retrofitting them afterward is expensive and often incomplete.</p><ul><li><strong>Data residency and geo-routing:</strong> Enterprises that need to ensure data is collected and transited through specific regions can enforce that at the proxy layer without giving up adaptive routing performance. Compliance constraints and performance optimization aren’t in conflict here.</li><li><strong>Access control and audit trails:</strong> Every request should be traceable: when it was made, from which configuration, against which target, and what the outcome was. Role-based access, API key management, and detailed request logging are table stakes for security teams and compliance auditors.</li><li><strong>Ethical collection practices:</strong> Legal and compliance teams increasingly require that collection respects robots.txt directives and avoids service disruption. Configurable rate limiting and documented collection policies let procurement and legal sign off on the operation, not just the technology.</li><li><strong>Vendor security posture:</strong> For enterprise procurement, the proxy provider’s own security matters as much as the product’s features: Data processing agreements, infrastructure security, and clear data handling policies. These requirements screen out most consumer-grade options before technical evaluation begins.</li></ul><h2 id="Operational-Efficiency"><a href="#Operational-Efficiency" class="headerlink" title="Operational Efficiency"></a>Operational Efficiency</h2><p>The engineering cost of maintaining proxy infrastructure rarely shows up clearly in budget discussions. Per-request cost is visible. The hours spent diagnosing failures, reconfiguring targets, and verifying fixes are not, and they add up.</p><p>With rule-based proxies, operational overhead scales directly with target count and complexity. Fifty target domains means fifty configurations to maintain. When anti-bot platforms push updates, and they do, unpredictably, the workflow is: detect the failure, diagnose the cause, reconfigure, verify. Multiply that across a large target set, and it’s a high recurring cost.</p><p><strong>AI proxy infrastructure changes the model in three concrete ways.</strong></p><ul><li><strong>Initial configuration is minimal:</strong> The adaptive layer handles per-target optimization from live request data; there’s no manual tuning required before the system starts learning.</li><li><strong>Adding new targets doesn’t add configuration work:</strong> The same adaptive logic applies to new domains from the first request, so expanding target coverage doesn’t grow the maintenance burden.</li><li><strong>Failures are handled automatically:</strong> Block events trigger classification and response at the infrastructure level. Engineers see outcomes in the data pipeline, not alerts requiring intervention.</li></ul><p>The result is that data engineering capacity goes toward the pipeline and the decisions it supports, not toward keeping the proxy layer alive.</p><h2 id="Enterprise-AI-Proxy-Use-Cases"><a href="#Enterprise-AI-Proxy-Use-Cases" class="headerlink" title="Enterprise AI Proxy: Use Cases"></a>Enterprise AI Proxy: Use Cases</h2><p>AI proxy infrastructure shows up across a range of enterprise data functions. What they share is the combination of volume, target sophistication, and operational requirements that rule-based proxies can’t sustain consistently.</p><ul><li><strong>Competitive intelligence</strong>: Continuous pricing and availability monitoring across multiple markets, hardened targets, and the need for complete data without regular engineering intervention.</li><li><strong>Financial data collection</strong>: Market data, alternative data feeds, and pricing signals from sources that actively restrict access. Success rate reliability is non-negotiable for risk and trading applications.</li><li><strong>Supply chain monitoring</strong>: Tracking supplier inventory and pricing across a large, diverse set of sources with wide variation in their defenses.</li><li><strong>Brand and compliance monitoring</strong>: Verifying how products are represented and priced across retail channels, with geographic coverage and session realism that reflects what real users actually see.</li><li><strong>Enterprise market research</strong>: Large-scale collection supporting strategy, product development, and market sizing, without requiring research teams to manage proxy infrastructure themselves.</li></ul><p>For more on specific applications, see the <a href="https://crawlbase.com/blog/ai-proxy-use-cases/">AI proxy use cases</a> breakdown.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'ai-proxy-for-enterprises', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="What-to-Evaluate-When-Buying-Enterprise-AI-Proxy-Infrastructure"><a href="#What-to-Evaluate-When-Buying-Enterprise-AI-Proxy-Infrastructure" class="headerlink" title="What to Evaluate When Buying Enterprise AI Proxy Infrastructure"></a>What to Evaluate When Buying Enterprise AI Proxy Infrastructure</h2><p>Not all AI proxy providers are the same. For enterprise procurement, the evaluation goes beyond headline success rates and IP pool size.</p><ul><li><strong>Adaptive intelligence depth:</strong> Does the system build actual per-target models, or apply generic heuristics dressed up as AI? The difference shows up clearly against hardened targets; generic heuristics fail faster and require more manual intervention.</li><li><strong>Session management capabilities:</strong> Full behavioral session management, cookie continuity, realistic timing, navigation patterns, are what separate AI proxy from smart proxy. Most providers <a href="https://crawlbase.com/blog/smart-proxy-vs-ai-proxy/">haven’t crossed that line yet</a>.</li><li><strong>Geographic coverage and routing precision:</strong> Enterprise use cases often require specific regional coverage. Evaluate both the breadth of geographies available and how precisely routing can be controlled.</li><li><strong>SLA and support depth:</strong> Enterprise operations need defined uptime commitments and technical support that understands proxy infrastructure, not just account management.</li><li><strong>Compliance documentation:</strong> Data processing agreements, security certifications, and audit logging capabilities should be evaluated alongside technical performance, especially for regulated industries.</li></ul><h2 id="Smart-AI-Proxy-is-Made-for-Enterprises"><a href="#Smart-AI-Proxy-is-Made-for-Enterprises" class="headerlink" title="Smart AI Proxy is Made for Enterprises"></a><strong>Smart AI Proxy is Made for Enterprises</strong></h2><p>Modern anti-bot defenses are built to defeat static infrastructure. They adapt. They update. And they specifically target the behavioral patterns that rule-based proxy configurations produce at scale.</p><p>Enterprise data operations need infrastructure that adapts at the same pace: learning per-target, adjusting automatically when targets change, and doing so without creating operational overhead that scales with target count. That’s what AI proxy infrastructure is built for, and why it’s become the default for serious enterprise data collection.</p><p><a href="https://crawlbase.com/smart-proxy">Crawlbase Smart AI Proxy</a> is built specifically for enterprise data operations: managed adaptive infrastructure with the reliability, compliance posture, and operational model that enterprise procurement and security teams require. <a href="https://crawlbase.com/signup?signup=blog">Sign up now and get 5,000 free credits</a></p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="What’s-the-difference-between-an-AI-proxy-and-an-enterprise-residential-proxy-network"><a href="#What’s-the-difference-between-an-AI-proxy-and-an-enterprise-residential-proxy-network" class="headerlink" title="What’s the difference between an AI proxy and an enterprise residential proxy network?"></a>What’s the difference between an AI proxy and an enterprise residential proxy network?</h3><p>Enterprise residential networks provide large, geo-distributed IP pools, but they operate on static rule-based logic. AI proxies add adaptive fingerprinting, behavioral session management, and per-target model learning on top of the IP layer. Against hardened targets, the intelligence layer is what keeps success rates high.</p><h3 id="How-does-AI-proxy-handle-high-concurrency-enterprise-workloads"><a href="#How-does-AI-proxy-handle-high-concurrency-enterprise-workloads" class="headerlink" title="How does AI proxy handle high-concurrency enterprise workloads?"></a>How does AI proxy handle high-concurrency enterprise workloads?</h3><p>AI proxy systems apply per-target optimization at the session level, not just the request level. Managing behavioral realism across thousands of concurrent sessions simultaneously is what prevents behavioral detection from triggering under high-concurrency conditions.</p><h3 id="Can-an-AI-proxy-integrate-with-existing-data-pipelines"><a href="#Can-an-AI-proxy-integrate-with-existing-data-pipelines" class="headerlink" title="Can an AI proxy integrate with existing data pipelines?"></a>Can an AI proxy integrate with existing data pipelines?</h3><p>Yes, the proxy endpoint sits transparently between your scraping framework and the target. Your pipeline sends requests and receives responses. No architectural changes are required.</p><h3 id="What-compliance-certifications-should-enterprise-proxy-providers-have"><a href="#What-compliance-certifications-should-enterprise-proxy-providers-have" class="headerlink" title="What compliance certifications should enterprise proxy providers have?"></a>What compliance certifications should enterprise proxy providers have?</h3><p>At minimum: GDPR-compliant data processing agreements and documented data retention policies. Regulated industries may require additional certifications depending on the data types involved.</p><h3 id="Is-an-AI-proxy-better-than-building-proxy-infrastructure-in-house"><a href="#Is-an-AI-proxy-better-than-building-proxy-infrastructure-in-house" class="headerlink" title="Is an AI proxy better than building proxy infrastructure in-house?"></a>Is an AI proxy better than building proxy infrastructure in-house?</h3><p>For most enterprises, a managed AI proxy delivers better performance at lower total cost than in-house development. Building and maintaining adaptive proxy infrastructure requires sustained ML engineering investment and ongoing optimization as the anti-bot landscape shifts, work that managed infrastructure absorbs.</p><h3 id="What-success-rate-should-enterprise-teams-expect-from-an-AI-proxy"><a href="#What-success-rate-should-enterprise-teams-expect-from-an-AI-proxy" class="headerlink" title="What success rate should enterprise teams expect from an AI proxy?"></a>What success rate should enterprise teams expect from an AI proxy?</h3><p>This varies by target sophistication, but well-implemented AI proxy infrastructure consistently outperforms rule-based systems against hardened targets, particularly after the per-target model has accumulated sufficient request data to optimize accurately.</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Direct Answer&lt;/strong&gt;: Enterprise teams adopt AI proxy infrastructure for three reasons: sustaining high success rates at scale, meeting compliance and security requirements, and cutting the engineering overhead that kills productivity on large, complex target sets.&lt;/p&gt;
&lt;/blockquote&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="ai proxy" scheme="https://crawlbase.com/blog/tags/ai-proxy/"/>
    
    <category term="ai proxy for enterprise" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-enterprise/"/>
    
    <category term="enterprise proxy infrastructure" scheme="https://crawlbase.com/blog/tags/enterprise-proxy-infrastructure/"/>
    
    <category term="ai powered proxy technology" scheme="https://crawlbase.com/blog/tags/ai-powered-proxy-technology/"/>
    
    <category term="enterprise web scraping" scheme="https://crawlbase.com/blog/tags/enterprise-web-scraping/"/>
    
  </entry>
  
  <entry>
    <title>AI Proxy Use Cases (2026 Guide)</title>
    <link href="https://crawlbase.com/blog/ai-proxy-use-cases/"/>
    <id>https://crawlbase.com/blog/ai-proxy-use-cases/</id>
    <published>2026-04-13T17:37:02.000Z</published>
    <updated>2026-04-24T11:53:22.819Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p><strong>Direct answer</strong>: An AI proxy is designed for a specific class of problem: collecting data from websites that actively try to prevent it. <a href="https://crawlbase.com/blog/what-is-an-ai-proxy/">Understanding what an AI proxy is</a> and <a href="https://crawlbase.com/blog/how-ai-proxies-work/">how it works</a> is the foundation.</p></blockquote><p>This blog is about where that capability is actually applied, the concrete use cases where <a href="https://crawlbase.com/smart-proxy">AI-powered proxy technology</a> delivers results that rule-based proxies consistently fail to deliver.</p><h2 id="1-Web-Scraping-and-Large-Scale-Data-Collection"><a href="#1-Web-Scraping-and-Large-Scale-Data-Collection" class="headerlink" title="1. Web Scraping and Large-Scale Data Collection"></a>1. Web Scraping and Large-Scale Data Collection</h2><p>Web scraping is the most common use of <a href="https://crawlbase.com/blog/best-proxy-scraping-api-for-startups/">AI proxy systems</a>. Any process that involves extracting data from websites at scale, such as product catalogs, news feeds, business listings, public records, and social data, faces one main hurdle: the target website doesn’t want to be scraped.</p><p>Modern anti-bot technologies assess more than just IP addresses. They analyze request patterns, track user behavior, and employ their machine learning to differentiate automated traffic from human traffic. Rule-based proxies can handle IP rotation, but they don’t address fingerprinting or user behavior analysis, which is where many scraping operations struggle.</p><p>AI proxies tackle this by changing request settings in real time. When a fingerprint or session pattern starts triggering blocks, the system notices and adjusts automatically, without needing help from engineers. This capability enables high-volume scraping against tough targets without constant manual adjustments.</p><p><strong>Where it matters most:</strong> E-commerce catalogs, real estate listings, job boards, news aggregation, social media data, and any site using <a href="https://crawlbase.com/blog/how-to-bypass-cloudfare-and-avoid-bot-detection/">Cloudflare</a>, or <a href="https://www.akamai.com/products/bot-manager">Akamai Bot Manager</a>.​</p><h2 id="2-Price-Monitoring"><a href="#2-Price-Monitoring" class="headerlink" title="2. Price Monitoring"></a>2. Price Monitoring</h2><p>Price monitoring involves high-frequency, high-volume requests to some of the most heavily protected websites online. Retail and e-commerce sites have a strong incentive to stop competitors from accessing their pricing data, and they invest heavily in anti-bot measures.</p><p>The challenge goes beyond just getting the first request through. <a href="https://crawlbase.com/blog/how-to-use-web-scraping-for-price-intelligence/">Price monitoring</a> is ongoing, reliable data is needed at regular intervals for thousands of products, from multiple sources, over months or years. Each session must appear authentic, not just once, but consistently over time.</p><p>AI proxies meet this need through effective session management and adaptable fingerprinting. The system maintains realistic session behavior across repeated visits, automatically adjusts to changes in detection logic, and routes requests using IP settings that have shown high success rates against that specific domain.</p><p><strong>Where it matters most:</strong> Retail and e-commerce price intelligence, competitive pricing tools, dynamic pricing engines, and marketplace monitoring on platforms like <a href="https://crawlbase.com/amazon-scraper">Amazon</a>, <a href="https://crawlbase.com/walmart-scraper">Walmart</a>, and major retailer websites.</p><h2 id="3-Ad-Verification"><a href="#3-Ad-Verification" class="headerlink" title="3. Ad Verification"></a>3. Ad Verification</h2><p>Ad verification involves viewing ads as a real user would, from specified locations, on specific devices, and in certain browser settings. Advertisers and agencies use it to ensure ads appear in suitable placements, reach the right audiences, and do not display alongside inappropriate content or on fraudulent sites.</p><p>The technical hurdles are considerable. <a href="https://crawlbase.com/blog/scrape-amazon-ppc-ads/">Ad platforms and publishers</a> want to showcase their best content to known auditors, meaning that identifying the verification tool undermines the entire purpose. Effective ad verification requires traffic that looks like real user traffic across every signal the platform evaluates.</p><p>AI proxies provide the location-based routing, realistic browser fingerprints, and human-like session behavior that ad verification needs. Requests seem to come from real users in the target area, on expected devices, with consistent behavioral patterns, making them hard to identify as automated verification traffic.</p><p><strong>Where it matters most:</strong> Display ad verification, programmatic ad auditing, geo-targeted campaign verification, brand safety monitoring, and fraud detection across ad networks and publisher sites.</p><h2 id="4-Market-Research"><a href="#4-Market-Research" class="headerlink" title="4. Market Research"></a>4. Market Research</h2><p><a href="https://crawlbase.com/blog/how-to-automate-ecommerce-product-research/">Market research at scale</a> involves gathering structured data from various sources, competitor sites, review platforms, industry publications, public databases, and social media, and doing so continuously as market conditions change. The variety of sources creates challenges: each target has distinct defenses, content structures, and updating frequencies.</p><p>Manually managing proxy settings across a large and diverse target set is costly. Every time a source updates its anti-bot measures, settings need to be diagnosed and adjusted. For research teams lacking dedicated scraping systems, this becomes a significant ongoing expense.</p><p>AI proxies significantly reduce that burden. The adaptive layer automatically optimizes per-target settings, and the research team receives reliable data from all sources without needing to maintain the proxy configurations. As sources change, the system adjusts without any manual intervention.</p><p><strong>Where it matters most:</strong> Competitive intelligence, brand monitoring, sentiment analysis, industry trend tracking, consumer review aggregation, and any market research process pulling from numerous sources.</p><h2 id="5-Travel-Fare-Aggregation"><a href="#5-Travel-Fare-Aggregation" class="headerlink" title="5. Travel Fare Aggregation"></a>5. Travel Fare Aggregation</h2><p><a href="https://crawlbase.com/blog/how-to-create-aggregator-website/">Travel fare aggregation</a>, collecting real-time pricing data from airlines, hotels, car rental services, and booking sites is one of the most challenging uses for proxies. Travel websites change prices frequently, protect their data vigorously, and implement complex defenses because fare aggregators pose a known threat to their profits.</p><p>The combination of immediate requirements, high request volumes, geo-sensitive pricing, and strong anti-bot systems makes it a scenario where rule-based proxies consistently fail. Success rates drop quickly, and maintaining reliable data feeds requires ongoing engineering work.</p><p>AI proxies excel in this area because their adaptive layer manages the diverse challenges simultaneously. Location-specific routing ensures the proxy requests prices from the correct regional context. Adaptive fingerprinting and session management tackle the user behavior detection that travel sites rely on. The continuous feedback loop keeps the system effective even as platforms enhance their defenses.</p><p><strong>Where it matters most:</strong> Flight and hotel price comparison tools, online travel agency data feeds, dynamic fare tracking tools, and <a href="https://www.keydatadashboard.com/en-gb">travel intelligence systems</a>.</p><h2 id="​What-These-Use-Cases-Have-in-Common"><a href="#​What-These-Use-Cases-Have-in-Common" class="headerlink" title="​What These Use Cases Have in Common"></a>​What These Use Cases Have in Common</h2><p>Across all five cases, the pattern is consistent: the target has strong incentives to block automated access, uses advanced defenses to do so, and updates those defenses frequently. Rule-based proxies cover some situations, but they struggle when targets go beyond IP reputation to behavioral and fingerprint-based detection, necessitating ongoing manual maintenance to remain effective.</p><p>AI proxies address the underlying problem: they adjust. <a href="https://crawlbase.com/blog/how-ai-proxies-work/">The technology driving this</a>, adaptive fingerprinting, smart block handling, and automated session management ensure high success rates are sustained across these use cases at scale, without the operational load of manual configuration.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>AI proxy technology is specifically made for data collection settings where targets actively work to disrupt you. Web scraping, price monitoring, ad verification, market research, and travel fare aggregation share this characteristic and all benefit from the adaptive intelligence that AI proxies provide.</p><p>If your data collection operations rely on dependable access to protected targets at scale, Crawlbase Smart AI Proxy is designed for these specific cases. <a href="https://crawlbase.com/signup?signup=blog">Sign up now</a> and get 5,000 free credits.<br>​</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="What-is-the-most-common-use-case-for-AI-proxies"><a href="#What-is-the-most-common-use-case-for-AI-proxies" class="headerlink" title="What is the most common use case for AI proxies?"></a>What is the most common use case for AI proxies?</h3><p>Web scraping and large-scale data collection are the most widespread applications. AI proxies are used whenever data extraction needs to function reliably against targets with strong anti-bot protections, now including most major commercial sites.</p><h3 id="Can-AI-proxies-handle-geo-specific-data-collection"><a href="#Can-AI-proxies-handle-geo-specific-data-collection" class="headerlink" title="Can AI proxies handle geo-specific data collection?"></a>Can AI proxies handle geo-specific data collection?</h3><p>Yes. AI proxies feature adaptive geo-routing that automatically selects IP settings based on the target area. This is crucial for price monitoring and ad verification, where accurate regional data access is required.</p><h3 id="How-are-AI-proxies-different-from-standard-proxies-for-these-cases"><a href="#How-are-AI-proxies-different-from-standard-proxies-for-these-cases" class="headerlink" title="How are AI proxies different from standard proxies for these cases?"></a>How are AI proxies different from standard proxies for these cases?</h3><p>Standard proxies manage IP rotation; they deal with IP-based blocking but not fingerprinting or behavioral analysis. AI proxies adjust across all three areas: IP routing, request fingerprinting, and session behavior. For situations involving modern anti-bot measures, this difference determines whether you maintain reliable data access or face decreasing success rates over time.</p><h3 id="Do-AI-proxies-work-for-real-time-data-collection-like-live-price-feeds"><a href="#Do-AI-proxies-work-for-real-time-data-collection-like-live-price-feeds" class="headerlink" title="Do AI proxies work for real-time data collection, like live price feeds?"></a>Do AI proxies work for real-time data collection, like live price feeds?</h3><p>Yes. AI proxies are built for high-frequency, continuous request patterns. The adaptive layer controls session behavior and request timing to keep traffic patterns realistic even at large volumes, which is what real-time price monitoring and fare aggregation demand.</p><h3 id="Which-industries-benefit-most-from-AI-proxy-technology"><a href="#Which-industries-benefit-most-from-AI-proxy-technology" class="headerlink" title="Which industries benefit most from AI proxy technology?"></a>Which industries benefit most from AI proxy technology?</h3><p>E-commerce, travel, financial services, advertising, and market research are the primary sectors. Any industry needing access to external data for competitive advantage, especially where that data is actively protected, fits well with AI proxy systems.</p><p>​</p>]]></content>
    
    
      
      
    <summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Direct answer&lt;/strong&gt;: An AI proxy is designed for a specific class of problem: collecting data from websites that </summary>
      
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="ai proxy" scheme="https://crawlbase.com/blog/tags/ai-proxy/"/>
    
    <category term="ai powered proxy technology" scheme="https://crawlbase.com/blog/tags/ai-powered-proxy-technology/"/>
    
    <category term="ai proxy usecases" scheme="https://crawlbase.com/blog/tags/ai-proxy-usecases/"/>
    
    <category term="ai proxy for scraping" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-scraping/"/>
    
    <category term="ai proxy for price monitoring" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-price-monitoring/"/>
    
    <category term="ai proxy for market research" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-market-research/"/>
    
    <category term="ai proxy travel fare aggregation" scheme="https://crawlbase.com/blog/tags/ai-proxy-travel-fare-aggregation/"/>
    
    <category term="ai proxy for ad verification" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-ad-verification/"/>
    
  </entry>
  
  <entry>
    <title>How to Scrape Google People Also Ask (Full PAA Extraction Guide)</title>
    <link href="https://crawlbase.com/blog/how-to-scrape-google-people-also-ask/"/>
    <id>https://crawlbase.com/blog/how-to-scrape-google-people-also-ask/</id>
    <published>2026-04-13T13:03:00.000Z</published>
    <updated>2026-04-24T11:53:23.351Z</updated>
    
    <content type="html"><![CDATA[<blockquote><p><strong>Direct Answer:</strong> Scraping Google’s People Also Ask (PAA) feature is a dynamic SERP box showing expandable question-and-answer pairs related to a search query, requires JavaScript rendering, HTML parsing, and structured extraction. Using the Crawlbase Crawling API (a web crawling solution that handles headless browsing, proxy rotation, and anti-bot logic), you can reliably collect PAA questions, answers, and nested expansions, then output clean JSON for SEO analysis, content gap discovery, and topic clustering across different markets.</p></blockquote><span id="more"></span><p>Google’s People Also Ask (PAA) box appears in roughly 40 to 45 percent of Google searches, making it one of the most consistent sources of user intent outside of organic results.</p><p>For SEO practitioners, PAA data is especially valuable because it exposes:</p><ul><li>Real user intent behind a keyword</li><li>Content gaps that competitors have not covered</li><li>FAQ and topic cluster opportunities</li><li>Featured snippet targets</li></ul><p>This guide walks through how to scrape Google People Also Ask programmatically using the <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawlbase Crawling API</a>. You’ll extract questions, answers, and nested expansions, then use that data for content gap analysis, FAQ generation, and topic clustering across different markets.</p><p>The full working code is available in the <a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask">ScraperHub repository</a>.</p><h3 id="Definition"><a href="#Definition" class="headerlink" title="Definition"></a>Definition</h3><p>PAA expansion tree: When a user clicks a question, Google loads 2-4 additional related questions. This creates a cascading structure. Most scraping setups capture only the first 3 to 4 visible items and miss everything beyond that initial layer.</p><div class="callout-banner">  <div class="banner-header">    <img      src="/blog/images/flashlight-icon-blue.png"      srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x"      alt="Flashlight Icon"    />    <h2 class="banner-header-label">Scrape Google People Also Ask</h2>  </div>  <p class="banner-body">    In recent benchmarks, Crawlbase maintained consistent response times even as request volume quintupled. Whether    you're running 2 or 10 req/s, we provide the steady performance your data pipeline needs.  </p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Build a Google PAA Scraper"      >Build a Google PAA Scraper</a    >    <img      src="/blog/images/arrow-right-double-green.png"      srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x"      alt="Arrow right double Icon"    />  </div></div><h2 id="How-Do-You-Scrape-Google’s-People-Also-Ask"><a href="#How-Do-You-Scrape-Google’s-People-Also-Ask" class="headerlink" title="How Do You Scrape Google’s People Also Ask"></a>How Do You Scrape Google’s People Also Ask</h2><p>At a high level, scraping PAA requires rendering the page, not just requesting it.</p><p>A simple HTTP request is not enough because PAA content loads after the page initializes and updates dynamically on interaction.</p><p>To extract it reliably:</p><ol><li>Send a Google search URL with gl and hl parameters to a rendering API</li><li>Wait for JavaScript execution, typically around 2000 ms</li><li>Parse the returned HTML using fallback selectors</li><li>Structure the output into JSON</li></ol><p>If you skip the rendering step, the PAA section will either be incomplete or missing entirely.</p><img src="/blog/how-to-scrape-google-people-also-ask/how-to-scrape-google-people-also-ask-workflow.jpg" class="" title="High-level workflow for scraping Google People Also Ask"><h2 id="What-to-Extract-Google’s-PAA-Data-Structure"><a href="#What-to-Extract-Google’s-PAA-Data-Structure" class="headerlink" title="What to Extract: Google’s PAA Data Structure"></a>What to Extract: Google’s PAA Data Structure</h2><p>Once you have the rendered HTML, the next step is structuring the data in a way that is actually usable.</p><p>A complete PAA record typically looks like this:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;question&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;answer&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;source_url&quot;</span><span class="punctuation">:</span> <span class="string">&quot;...&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;children&quot;</span><span class="punctuation">:</span> <span class="punctuation">[</span><span class="punctuation">]</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>Each field serves a specific purpose:</p><ul><li>question: expands keyword coverage and topic discovery</li><li>answer: helps with featured snippet optimization</li><li>source_url: supports competitor analysis</li><li>children: captures deeper levels of the expansion tree</li></ul><p>Another way to think about it is that each question becomes a node, and each expansion adds more nodes beneath it.</p><p>Most scrapers stop at the first layer. That leaves a large portion of available data untouched.</p><h2 id="Why-Use-Crawlbase-for-Google’s-PAA-Extraction"><a href="#Why-Use-Crawlbase-for-Google’s-PAA-Extraction" class="headerlink" title="Why Use Crawlbase for Google’s PAA Extraction"></a>Why Use Crawlbase for Google’s PAA Extraction</h2><p>At this point, the main challenge is not parsing. It’s getting reliable, fully rendered HTML from Google.</p><p>Crawlbase simplifies that entire process. Instead of managing headless browsers, proxies, and retry logic, you work with a single API endpoint that handles those layers for you.</p><p>The <a href="https://crawlbase.com/docs/crawling-api/">Crawling API</a> uses one base URL:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">https://api.crawlbase.com</span><br></pre></td></tr></table></figure><p>You only need two required parameters:</p><ul><li><code>token</code></li><li><code>url</code></li></ul><p>For Google SERPs, you should use your <a href="https://crawlbase.com/docs/crawling-api/headless-browsers/">JavaScript token</a> and include page_wait so the PAA section has time to load. A timeout of at least 90 seconds is recommended for stability.</p><p>Here is a sample request:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"></span><br><span class="line"><span class="comment"># Replace with your Crawlbase JS token</span></span><br><span class="line">token = <span class="string">&quot;YOUR_JS_TOKEN&quot;</span></span><br><span class="line"></span><br><span class="line">url = <span class="string">&quot;https://www.google.com/search?q=web+scraping&amp;gl=us&amp;hl=en&quot;</span></span><br><span class="line"></span><br><span class="line">params = &#123;</span><br><span class="line">    <span class="string">&quot;token&quot;</span>: token,</span><br><span class="line">    <span class="string">&quot;url&quot;</span>: url,</span><br><span class="line">    <span class="string">&quot;page_wait&quot;</span>: <span class="number">2000</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">response = requests.get(<span class="string">&quot;https://api.crawlbase.com/&quot;</span>, params=params, timeout=<span class="number">90</span>)</span><br><span class="line">html = response.text</span><br></pre></td></tr></table></figure><p>This single request already returns fully rendered HTML, including the PAA section. From there, you can pass the response directly into your parser.</p><p>This replaces an entire stack that would otherwise include browser automation tools, proxy rotation systems, and custom anti-block handling. That simplicity is what makes it practical to scale PAA extraction beyond a handful of queries.</p><h2 id="How-Do-You-Run-a-Complete-Google’s-PAA-Scraper"><a href="#How-Do-You-Run-a-Complete-Google’s-PAA-Scraper" class="headerlink" title="How Do You Run a Complete Google’s PAA Scraper"></a>How Do You Run a Complete Google’s PAA Scraper</h2><p>Now that the pieces are clear, the fastest way to get started is not to build everything manually, but to use a complete implementation.</p><p>The ScraperHub repository already includes a working pipeline for fetching, parsing, and exporting PAA data. You can clone it and run it locally in a few minutes.</p><h3 id="Step-1-Clone-the-Scraper"><a href="#Step-1-Clone-the-Scraper" class="headerlink" title="Step 1: Clone the Scraper"></a>Step 1: Clone the Scraper</h3><p>Go to the repository: <a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask">ScraperHub&#x2F;How-to-scrape-google-PAA</a></p><p>Clone it:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ScraperHub/how-to-scrape-google-people-also-ask.git</span><br><span class="line"><span class="built_in">cd</span> how-to-scrape-google-people-also-ask</span><br></pre></td></tr></table></figure><h3 id="Step-2-Understand-How-the-Scraper-Works"><a href="#Step-2-Understand-How-the-Scraper-Works" class="headerlink" title="Step 2: Understand How the Scraper Works"></a>Step 2: Understand How the Scraper Works</h3><p>Before running it, it helps to know how the pieces fit together.</p><ul><li><a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask/blob/main/main.py">main.py</a> builds the search URL, runs the pipeline, and writes JSON</li><li><a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask/blob/main/config.py">config.py</a> manages tokens, retries, and timeouts</li><li><a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask/blob/main/fetcher.py">fetcher.py</a> handles requests to Crawlbase</li><li><a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask/blob/main/parser.py">parser.py</a> extracts PAA data using fallback selectors</li></ul><img src="/blog/how-to-scrape-google-people-also-ask/how-to-scrape-google-people-also-ask-how-it-works.jpg" class="" title="An image of how the Google PAA Scraper works, which comprises main.py, parser.py, fetcher.py and config.py"><p>Each file does one job. Together, they form a complete scraping pipeline.</p><h3 id="Step-3-Set-Up-the-Environment"><a href="#Step-3-Set-Up-the-Environment" class="headerlink" title="Step 3: Set Up the Environment"></a>Step 3: Set Up the Environment</h3><p>Make sure the <a href="https://www.python.org/downloads/">latest Python version</a> is installed, then set up your environment:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">python3 -m venv .venv</span><br><span class="line"><span class="built_in">source</span> .venv/bin/activate   <span class="comment"># Windows: .venv\Scripts\activate</span></span><br><span class="line"></span><br><span class="line">pip install -r requirements.txt</span><br></pre></td></tr></table></figure><p>Set your <a href="https://crawlbase.com/dashboard/account/docs">Crawlbase tokens</a>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">export</span> CRAWLBASE_TOKEN=your_normal_token</span><br><span class="line"><span class="built_in">export</span> CRAWLBASE_JS_TOKEN=your_js_token</span><br></pre></td></tr></table></figure><p>The <strong>JavaScript token</strong> or <strong>Browser Enabled API Key</strong> is required for Google SERPs.</p><h3 id="Step-4-Run-the-Scraper"><a href="#Step-4-Run-the-Scraper" class="headerlink" title="Step 4: Run the Scraper"></a>Step 4: Run the Scraper</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;how to scrape google&quot;</span></span><br></pre></td></tr></table></figure><p>This runs the full flow:</p><ul><li>Builds the Google SERP URL</li><li>Fetches rendered HTML</li><li>Parses PAA questions and answers</li><li>Outputs structured JSON</li></ul><h3 id="Step-5-Customize-Your-Runs"><a href="#Step-5-Customize-Your-Runs" class="headerlink" title="Step 5: Customize Your Runs"></a>Step 5: Customize Your Runs</h3><p>You can adjust parameters directly from the CLI.</p><p>Change country:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;content gap analysis&quot;</span> --country uk -o paa_uk.json</span><br></pre></td></tr></table></figure><p>Adjust rendering time:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;web scraping best practices&quot;</span> --page-wait 3000</span><br></pre></td></tr></table></figure><p>If results look incomplete, increasing <code>page_wait</code> (value in milliseconds) is usually the first fix.</p><h3 id="Step-6-Test-the-Scraper"><a href="#Step-6-Test-the-Scraper" class="headerlink" title="Step 6: Test the Scraper"></a>Step 6: Test the Scraper</h3><p>Run the <a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask/blob/main/run_tests.py">test suite</a>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python3 run_tests.py</span><br></pre></td></tr></table></figure><p>Or, if you’re using pytest:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python3 -m pytest tests/ -v</span><br></pre></td></tr></table></figure><p>These tests use saved Google SERP HTML to verify that your parser still extracts questions, answers, and source URLs correctly. It’s a quick way to catch breakages when Google changes its page structure before running large scraping jobs.</p><h2 id="Capturing-Nested-Expansions-in-Google-PAA"><a href="#Capturing-Nested-Expansions-in-Google-PAA" class="headerlink" title="Capturing Nested Expansions in Google PAA"></a>Capturing Nested Expansions in Google PAA</h2><p>Up to this point, you are extracting the initial set of PAA questions. That alone gives you a useful dataset, but it’s still incomplete, as the real value comes from going deeper into the expansion tree.</p><p>When you expand a PAA question, Google dynamically loads additional related questions. Each of those can trigger further expansions, creating a layered structure of queries.</p><p>To capture this behavior, you use the <a href="https://crawlbase.com/docs/crawling-api/parameters/#css-click-selector"><code>css_click_selector</code></a> parameter in the Crawling API. This allows you to simulate clicks on PAA elements so the additional questions load before parsing.</p><p>The flow works like this:</p><ul><li>Build the SERP URL with your query and geo parameters</li><li>Fetch the rendered HTML using the Crawling API</li><li>Parse the initial PAA set</li><li>Trigger expansions using <code>css_click_selector</code></li><li>Re-fetch or re-parse the updated DOM</li><li>Output the full dataset</li></ul><p>Each expansion adds another layer to your data. In practice, a single query can grow from 3 to 4 visible questions to 12 to 20 total questions after a few expansion levels.</p><p>This step is optional from an implementation standpoint, but it’s where most of the missing value lives.</p><h2 id="How-Do-You-Compare-Google-PAA-Across-Countries"><a href="#How-Do-You-Compare-Google-PAA-Across-Countries" class="headerlink" title="How Do You Compare Google PAA Across Countries"></a>How Do You Compare Google PAA Across Countries</h2><p>PAA results are not universal. They vary by location and language.</p><p>To compare them, run the same query with different gl values:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">queries = [</span><br><span class="line">    build_serp_url(<span class="string">&quot;best running shoes&quot;</span>, <span class="string">&quot;us&quot;</span>),</span><br><span class="line">    build_serp_url(<span class="string">&quot;best running shoes&quot;</span>, <span class="string">&quot;uk&quot;</span>),</span><br><span class="line">    build_serp_url(<span class="string">&quot;best running shoes&quot;</span>, <span class="string">&quot;de&quot;</span>)</span><br><span class="line">]</span><br></pre></td></tr></table></figure><p>Compare:</p><ul><li>Unique questions</li><li>Overlapping topics</li><li>Differences in answers</li></ul><p>This is particularly useful when expanding into new regions or localizing content.</p><h2 id="When-Should-You-Use-the-Enterprise-Crawler"><a href="#When-Should-You-Use-the-Enterprise-Crawler" class="headerlink" title="When Should You Use the Enterprise Crawler?"></a>When Should You Use the Enterprise Crawler?</h2><p>The standard Crawling API works well for small batches where you fetch results immediately. Once you scale to thousands or even millions of queries, it becomes harder to manage.</p><p>The <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">Enterprise Crawler</a> is built for that scale. It runs asynchronously, so you can push URLs in bulk and receive results later via a webhook.</p><p>You don’t need to rewrite your scraper. Just update the request in <a href="https://github.com/ScraperHub/how-to-scrape-google-people-also-ask/blob/main/fetcher.py">fetcher.py</a>:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">params[<span class="string">&quot;callback&quot;</span>] = <span class="literal">True</span></span><br><span class="line">params[<span class="string">&quot;crawler&quot;</span>] = <span class="string">&quot;MyPAACrawler&quot;</span></span><br></pre></td></tr></table></figure><p>To receive results, you’ll need a webhook.</p><p>You can either use the <a href="https://crawlbase.com/cloud-storage-for-crawling-and-scraping">Crawlbase Cloud Storage</a> for a quick setup or create your own endpoint if you want full control</p><p>If you build your own, it just needs to accept POST requests, be publicly accessible, and return a quick 200–204 response. For local testing, tools like <a href="https://ngrok.com/">ngrok</a> work well.</p><p>Use it when you are building large datasets or running recurring jobs. Check the <a href="https://crawlbase.com/docs/crawler/">Crawler documentation</a> to learn more.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">      Try our <strong class="text-underline">Crawling API</strong> to automate your data collection — used by 70k+ dev      teams    </p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a        class="banner-btn"        href="/signup?signup=blog-smart-cta"        title="Get Started Now!"        onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'how-to-scrape-google-people-also-ask', 'cta_type': 'try_crawling_api', 'cta_position': 'bottom','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/how-to-scrape-google-people-also-ask/', 'page_title': 'How to Scrape Google People Also Ask (Full PAA Extraction Guide)' });"        >Get Started Now!</a      >    </div>  </div></div><h2 id="Real-World-Applications-of-Google-PAA-Data"><a href="#Real-World-Applications-of-Google-PAA-Data" class="headerlink" title="Real-World Applications of Google PAA Data"></a>Real-World Applications of Google PAA Data</h2><p>PAA data is directly usable in production workflows because it reflects how users actually phrase their questions.</p><p>You can use it to:</p><ul><li><strong>Build FAQ sections</strong> based on real queries instead of guessing what users ask</li><li><strong>Identify content gaps</strong> by spotting questions your competitors have not answered</li><li><strong>Create topic clusters</strong> by grouping related questions into supporting articles</li><li><strong>Improve featured snippet targeting</strong> by aligning your answers with how Google already structures responses</li></ul><p>What makes this valuable is that it removes guesswork from content planning. You are working with questions that already surface in search, not assumptions.</p><p>For example, a SaaS team targeting “web scraping tools” might extract 15 to 20 PAA questions from a single query. Instead of treating those as raw data, they can turn each question into a dedicated FAQ section, a supporting blog post, or even a subsection within a larger guide.</p><p>Over time, these questions naturally form a content cluster around the main topic, making it easier to cover the space comprehensively and compete for both rankings and featured snippets.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>PAA is one of the most underutilized datasets in search. If you only capture the initial questions, you are missing most of the available insights.</p><p>With Crawlbase and the ScraperHub implementation, you can extract the full expansion tree, structure it into usable data, and scale it across different markets without managing browsers, proxies, or infrastructure.</p><p>Try this yourself now by <a href="https://crawlbase.com/signup?signup=blog">creating a Crawlbase account</a> and use the 1,000 free requests to run the scraper on your own queries. It’s a quick way to see how much additional data you can unlock from a single search.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="What-is-a-People-Also-Ask-box"><a href="#What-is-a-People-Also-Ask-box" class="headerlink" title="What is a People Also Ask box?"></a>What is a People Also Ask box?</h3><p>A PAA box is a Google SERP feature showing 3-4 expandable question-answer pairs related to the search query. It appears in roughly 43% of searches and expands dynamically when clicked.</p><h3 id="Is-scraping-Google-PAA-legal"><a href="#Is-scraping-Google-PAA-legal" class="headerlink" title="Is scraping Google PAA legal?"></a>Is scraping Google PAA legal?</h3><p>Scraping publicly available search results exists in a legal grey area. We recommend reviewing Google’s Terms of Service before using scraped data in any application. Crawlbase provides the tools to crawl and extract publicly accessible data, but how that data is used is ultimately your responsibility.</p><h3 id="How-many-PAA-questions-can-one-query-return"><a href="#How-many-PAA-questions-can-one-query-return" class="headerlink" title="How many PAA questions can one query return?"></a>How many PAA questions can one query return?</h3><p>The initial PAA box shows 3-4 questions. Each expansion adds 2-4 more. A 3-level deep expansion tree typically yields 12-20 total questions per query.</p><h3 id="Why-does-PAA-vary-by-location"><a href="#Why-does-PAA-vary-by-location" class="headerlink" title="Why does PAA vary by location?"></a>Why does PAA vary by location?</h3><p>Google personalises PAA results based on the searcher’s country and language settings. The same query in the US and UK often returns different questions because user behaviour, language patterns, and available content differ by market.</p><h3 id="What-happens-when-Google-changes-its-HTML-selectors"><a href="#What-happens-when-Google-changes-its-HTML-selectors" class="headerlink" title="What happens when Google changes its HTML selectors?"></a>What happens when Google changes its HTML selectors?</h3><p>Your parser will silently return empty results. Use layered fallback selectors, log which selector fires on each run, and set up a monitoring alert if the results count drops below a threshold.</p><h3 id="How-often-does-Google-update-PAA-for-a-given-keyword"><a href="#How-often-does-Google-update-PAA-for-a-given-keyword" class="headerlink" title="How often does Google update PAA for a given keyword?"></a>How often does Google update PAA for a given keyword?</h3><p>PAA sets are relatively stable for informational queries (weeks to months) but can shift within hours for trending or news-adjacent topics. For monitoring use cases, a weekly crawl cadence is sufficient for most evergreen keywords</p>]]></content>
    
    
    <summary type="html">&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Direct Answer:&lt;/strong&gt; Scraping Google’s People Also Ask (PAA) feature is a dynamic SERP box showing expandable question-and-answer pairs related to a search query, requires JavaScript rendering, HTML parsing, and structured extraction. Using the Crawlbase Crawling API (a web crawling solution that handles headless browsing, proxy rotation, and anti-bot logic), you can reliably collect PAA questions, answers, and nested expansions, then output clean JSON for SEO analysis, content gap discovery, and topic clustering across different markets.&lt;/p&gt;
&lt;/blockquote&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="Scrape Google People Also Ask" scheme="https://crawlbase.com/blog/tags/Scrape-Google-People-Also-Ask/"/>
    
    <category term="Scrape People Also Search For" scheme="https://crawlbase.com/blog/tags/Scrape-People-Also-Search-For/"/>
    
    <category term="Google People Also Ask Scraper" scheme="https://crawlbase.com/blog/tags/Google-People-Also-Ask-Scraper/"/>
    
    <category term="Scrape google people also ask answers" scheme="https://crawlbase.com/blog/tags/Scrape-google-people-also-ask-answers/"/>
    
  </entry>
  
  <entry>
    <title>Web Scraping API for Enterprise - What CTOs Look For</title>
    <link href="https://crawlbase.com/blog/web-scraping-api-for-enterprise/"/>
    <id>https://crawlbase.com/blog/web-scraping-api-for-enterprise/</id>
    <published>2026-04-02T15:53:12.000Z</published>
    <updated>2026-04-24T11:53:23.979Z</updated>
    
    <content type="html"><![CDATA[<p>A web scraping API for enterprise should give you three things: predictable scaling, reliable data delivery close to 100% completion, and a system your security and finance teams can approve without friction. Anything less turns into engineering overhead.</p><span id="more"></span><p>Choosing a web scraping API for an enterprise is not about features. It’s a decision that affects delivery speed, data pipeline reliability, and whether your security and finance teams approve deployment. Most vendors claim enterprise readiness, but very few hold up under real production load.</p><p>This guide breaks down what CTOs actually evaluate: scalability, integration complexity, reliability, and compliance. You’ll also see how <a href="https://crawlbase.com/?signup=blog">Crawlbase</a> maps to those requirements with practical examples and real implementation patterns.</p><div class="callout-banner">  <div class="banner-header">    <img      src="/blog/images/flashlight-icon-blue.png"      srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x"      alt="Flashlight Icon"    />    <h2 class="banner-header-label">Scale Without the Speed Wobbles</h2>  </div>  <p class="banner-body">    In recent benchmarks, Crawlbase maintained consistent response times even as request volume quintupled. Whether    you're running 2 or 10 req/s, we provide the steady performance your data pipeline needs.  </p>  <div class="banner-footer">    <a      href="https://crawlbase.com/signup?signup=blog-callout-cta"      title="Build a Scalable Scraper"    >Build a Scalable Scraper</a>    <img      src="/blog/images/arrow-right-double-green.png"      srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x"      alt="Arrow right double Icon"    />  </div></div><h2 id="What-Should-a-CTO-Demand-from-a-Web-Scraping-API-for-Enterprise"><a href="#What-Should-a-CTO-Demand-from-a-Web-Scraping-API-for-Enterprise" class="headerlink" title="What Should a CTO Demand from a Web Scraping API for Enterprise?"></a>What Should a CTO Demand from a Web Scraping API for Enterprise?</h2><p>At the enterprise level, scraping is infrastructure. You are not testing a tool. You are committing to a system that will process millions of requests and feed business-critical pipelines.</p><p>A useful way to evaluate vendors is a requirements checklist:</p><h3 id="TL-DR-Web-Scraping-API-for-Enterprise-Requirements-Checklist"><a href="#TL-DR-Web-Scraping-API-for-Enterprise-Requirements-Checklist" class="headerlink" title="TL:DR: Web Scraping API for Enterprise Requirements Checklist"></a>TL:DR: Web Scraping API for Enterprise Requirements Checklist</h3><table><thead><tr><th>Requirement</th><th>What to Validate</th><th>Why It Matters</th></tr></thead><tbody><tr><td>Scalability</td><td>Requests per second, concurrency limits, scaling model</td><td>Determines if your pipeline grows without re-architecture</td></tr><tr><td>SLA &#x2F; Reliability</td><td>Published uptime, retry expectations</td><td>Prevents silent data loss in production</td></tr><tr><td>Security</td><td>Auth model, HTTPS, IP handling</td><td>Required for internal security reviews</td></tr><tr><td>Compliance</td><td>GDPR, DPA, sub-processors</td><td>Legal approval blocker in most orgs</td></tr><tr><td>Cost Model</td><td>Pay-per-success vs per-attempt</td><td>Impacts forecasting and budget control</td></tr></tbody></table><h3 id="With-Crawlbase"><a href="#With-Crawlbase" class="headerlink" title="With Crawlbase:"></a>With Crawlbase:</h3><ul><li>Up to 20 requests per second per token (can be increased for enterprise workloads)</li><li>Scaling handled through higher rate limits and Enterprise Crawler concurrency</li><li>Built-in IP rotation and anti-bot handling</li><li>Pay-per-success billing model</li></ul><p>At sustained usage, this translates to millions of requests per month, depending on workload characteristics.</p><p>More importantly, scaling does not require architectural changes on your side. You do not need to manage multiple tokens, distribute load manually, or redesign your system as demand grows. Capacity is provisioned based on your workload, which keeps both engineering and operational overhead low.</p><h2 id="How-Does-Crawlbase-Handle-Enterprise-Scale-Workloads"><a href="#How-Does-Crawlbase-Handle-Enterprise-Scale-Workloads" class="headerlink" title="How Does Crawlbase Handle Enterprise-Scale Workloads?"></a>How Does Crawlbase Handle Enterprise-Scale Workloads?</h2><p>When you’re operating at enterprise scale, raw throughput is only part of the equation. What actually matters is how the system behaves under pressure. Can it maintain consistent success rates when traffic spikes? Can your team rely on it without constantly dealing with failures?</p><p>This is where most in-house scraping setups start to struggle. As demand increases, teams often end up managing a mix of proxy pools, CAPTCHA solvers, and headless browsers to keep things running. Over time, that setup becomes harder to maintain than the data pipeline itself.</p><p>Crawlbase simplifies this by putting everything behind a single API layer. Instead of managing multiple moving parts, your team interacts with one consistent interface while the complexity stays behind the scenes.</p><p>In practical terms, that means:</p><ul><li>No proxy infrastructure to maintain</li><li>No rotation logic to build or debug</li><li>No ongoing effort to keep up with anti-bot changes</li></ul><p>Operational behavior is also clearly defined, which makes a big difference when you’re designing production systems:</p><ul><li>Typical response time: 4 to 10 seconds</li><li>Recommended client timeout: 90 seconds</li><li>Rate limits enforced through HTTP 429 responses</li></ul><p>That consistency is what allows teams to plan properly. You can design retry logic with confidence, estimate throughput more accurately, and forecast costs without relying on guesswork. In most enterprise environments, that level of predictability is more valuable than chasing peak performance.</p><h2 id="How-Fast-Can-a-Junior-Developer-Ship-a-Web-Scraping-Integration"><a href="#How-Fast-Can-a-Junior-Developer-Ship-a-Web-Scraping-Integration" class="headerlink" title="How Fast Can a Junior Developer Ship a Web Scraping Integration?"></a>How Fast Can a Junior Developer Ship a Web Scraping Integration?</h2><p>Integration speed is easy to underestimate, but it usually directly affects how quickly your team can ship anything that depends on external data.</p><p>In a typical in-house setup, even a simple scraper becomes a multi-step process. You’re not just fetching pages. You’re setting up infrastructure, handling edge cases, and making sure it doesn’t break after a few hours in production.</p><p>That usually looks like:</p><ul><li>1–2 weeks to get proxy infrastructure working reliably</li><li>Additional time spent on retries, CAPTCHA handling, and rendering</li><li>Ongoing debugging when targets change or start blocking requests</li></ul><p>By contrast, Crawlbase reduces that initial effort to something much smaller. Once the basics are in place, most teams can get a working integration running in hours or a few days.</p><p>You’re basically going from building the plumbing yourself to calling an API that already handles it. That difference shows up quickly in how fast a junior developer can go from zero to a working data pipeline.</p><h2 id="Example-Working-Setup"><a href="#Example-Working-Setup" class="headerlink" title="Example Working Setup"></a>Example Working Setup</h2><p>Requirements:</p><ul><li>Python or Node.js runtime</li><li>Crawlbase token</li><li>Network access</li></ul><p>Below is a simplified version of the request. You can find the complete, production-ready implementation with retries and logging in the <a href="https://github.com/ScraperHub/web-scraping-api-for-enterprise-what-cto-look-for">ScraperHub GitHub repository</a>.</p><h3 id="Python-Example"><a href="#Python-Example" class="headerlink" title="Python Example"></a>Python Example</h3><p>See full implementation: <a href="https://github.com/ScraperHub/web-scraping-api-for-enterprise-what-cto-look-for/blob/main/fetcher.py">Crawlbase fetcher.py</a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">token = token <span class="keyword">or</span> get_token(use_js=use_js)</span><br><span class="line">params = &#123;<span class="string">&quot;token&quot;</span>: token, <span class="string">&quot;url&quot;</span>: url&#125;</span><br><span class="line"><span class="keyword">if</span> page_wait <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>:</span><br><span class="line">    params[<span class="string">&quot;page_wait&quot;</span>] = page_wait</span><br><span class="line">resp = requests.get(CRAWLBASE_API_BASE, params=params, timeout=timeout)</span><br><span class="line">html = resp.text</span><br></pre></td></tr></table></figure><h3 id="Node-js-Example"><a href="#Node-js-Example" class="headerlink" title="Node.js Example"></a>Node.js Example</h3><p>See full implementation: <a href="https://github.com/ScraperHub/web-scraping-api-for-enterprise-what-cto-look-for/blob/main/fetcher.js">Crawlbase fetcher.js</a></p><figure class="highlight javascript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">const</span> params = &#123; <span class="attr">token</span>: apiToken, url &#125;;</span><br><span class="line"><span class="keyword">if</span> (pageWait != <span class="literal">null</span>) params.<span class="property">page_wait</span> = pageWait;</span><br><span class="line"><span class="keyword">const</span> response = <span class="keyword">await</span> client.<span class="title function_">get</span>(<span class="string">&#x27;&#x27;</span>, &#123; params, <span class="attr">responseType</span>: <span class="string">&#x27;text&#x27;</span> &#125;);</span><br><span class="line"><span class="keyword">const</span> html = response.<span class="property">data</span>;</span><br></pre></td></tr></table></figure><p>The important part is not the code itself. It’s what’s missing:</p><ul><li>No proxy logic</li><li>No retry system (yet)</li><li>No rendering setup</li></ul><p>That complexity is abstracted behind the API. Your team spends time building features, not maintaining scraping infrastructure.</p><h2 id="How-Do-You-Prevent-Data-Loss-in-Production-Pipelines"><a href="#How-Do-You-Prevent-Data-Loss-in-Production-Pipelines" class="headerlink" title="How Do You Prevent Data Loss in Production Pipelines?"></a>How Do You Prevent Data Loss in Production Pipelines?</h2><p>At scale, failures are not edge cases. They are expected behavior.</p><p>You will encounter:</p><ul><li>HTTP 429 (rate limits)</li><li>503 (temporary blocks)</li><li>Timeouts</li><li>Connection errors</li></ul><p>The difference between a stable pipeline and a broken one is the retry strategy.</p><h3 id="Recommended-Approach-Exponential-Backoff"><a href="#Recommended-Approach-Exponential-Backoff" class="headerlink" title="Recommended Approach: Exponential Backoff"></a>Recommended Approach: Exponential Backoff</h3><p>Crawlbase does not retry requests automatically. This is intentional. It gives you control over retry behavior.</p><p>The ScraperHub example repository shows a working implementation using <a href="https://tenacity.readthedocs.io/en/latest/">tenacity</a> in Python and <a href="https://www.npmjs.com/package/axios-retry">axios-retry</a> in Node. Both wrap the same request to the Crawlbase API, but add structured retry logic on top.</p><p>Below is a simplified version of our <a href="https://github.com/ScraperHub/web-scraping-api-for-enterprise-what-cto-look-for/blob/main/fetcher.py">Python implementation</a> example.</p><h3 id="Retry-logic"><a href="#Retry-logic" class="headerlink" title="Retry logic"></a>Retry logic</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">@retry(<span class="params"></span></span></span><br><span class="line"><span class="params"><span class="meta">    stop=stop_after_attempt(<span class="params">RETRY_ATTEMPTS</span>),</span></span></span><br><span class="line"><span class="params"><span class="meta">    wait=wait_exponential(<span class="params"><span class="built_in">min</span>=RETRY_MIN_WAIT_SECONDS, <span class="built_in">max</span>=RETRY_MAX_WAIT_SECONDS</span>),</span></span></span><br><span class="line"><span class="params"><span class="meta">    retry=retry_if_exception_type(<span class="params">(<span class="params">ConnectionError, requests.Timeout</span>)</span>)</span></span></span><br><span class="line"><span class="params"><span class="meta">    | retry_if_exception(<span class="params">_should_retry_http</span>),</span></span></span><br><span class="line"><span class="params"><span class="meta">    reraise=<span class="literal">True</span>,</span></span></span><br><span class="line"><span class="params"><span class="meta"></span>)</span></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">fetch_page</span>(<span class="params">url, *, token=<span class="literal">None</span>, page_wait=<span class="literal">None</span>, country=<span class="literal">None</span>, ...</span>):</span><br><span class="line">    <span class="comment"># ... params, requests.get, response validation</span></span><br></pre></td></tr></table></figure><p>This setup retries on:</p><ul><li>HTTP 429 and 503 responses</li><li>ConnectionError and Timeout exceptions</li></ul><p>At the same time, <code>_should_retry_http</code> ensures you don’t retry requests that are unlikely to succeed, such as 401 or 404 responses.</p><p>Without a retry layer like this, data gaps don’t always show up immediately. They tend to surface later in analytics dashboards, reports, or downstream systems, where they are much harder to trace back and fix.</p><h2 id="Does-Multi-Language-SDK-Support-Reduce-Maintenance-Cost"><a href="#Does-Multi-Language-SDK-Support-Reduce-Maintenance-Cost" class="headerlink" title="Does Multi-Language SDK Support Reduce Maintenance Cost?"></a>Does Multi-Language SDK Support Reduce Maintenance Cost?</h2><p>Enterprise systems are rarely built on a single language. Most teams end up with a mix of services, each optimized for a different part of the pipeline.</p><p>You might have:</p><ul><li>Python handling data pipelines</li><li>Node.js powering services or APIs</li><li>Java running core backend systems</li></ul><p>In that kind of environment, consistency matters more than anything else. The same <a href="https://crawlbase.com/docs/crawling-api/parameters/">API parameters</a>, like <code>token</code>, <code>url</code>, <code>page_wait</code>, and <code>country</code>, should behave the same no matter which language you’re using.</p><p>Crawlbase addresses this by providing official SDKs across multiple languages, so teams don’t have to reimplement the same HTTP logic in every service.</p><h3 id="Crawlbase-SDK-Coverage"><a href="#Crawlbase-SDK-Coverage" class="headerlink" title="Crawlbase SDK Coverage"></a>Crawlbase SDK Coverage</h3><table><thead><tr><th>Language&#x2F;Framework</th><th>SDK</th><th>GitHub</th></tr></thead><tbody><tr><td>Python</td><td>crawlbase-python</td><td><code>https://github.com/crawlbase/crawlbase-python</code></td></tr><tr><td>Node.js</td><td>crawlbase-node</td><td><code>https://github.com/crawlbase/crawlbase-node</code></td></tr><tr><td>PHP</td><td>crawlbase-php</td><td><code>https://github.com/crawlbase/crawlbase-php</code></td></tr><tr><td>Ruby</td><td>crawlbase-ruby</td><td><code>https://github.com/crawlbase/crawlbase-ruby</code></td></tr><tr><td>Java</td><td>crawlbase-java</td><td><code>https://github.com/crawlbase/crawlbase-java</code></td></tr><tr><td>Scrapy (Python)</td><td>scrapy-crawlbase-middleware</td><td><code>https://github.com/crawlbase/scrapy-crawlbase-middleware</code></td></tr></tbody></table><p>This lets teams choose what fits their stack without changing how the API behaves.</p><ul><li>JVM-based services can use crawlbase-java</li><li>PHP applications like Laravel or WordPress can use crawlbase-php</li><li>Rails apps can use crawlbase-ruby</li><li>Existing Scrapy pipelines can plug in scrapy-crawlbase-middleware</li><li>Node.js projects can use crawlbase-node or stick with a raw axios setup</li></ul><p>The <a href="https://github.com/ScraperHub/web-scraping-api-for-enterprise-what-cto-look-for">ScraperHub example repository</a> takes the raw approach using requests and axios, which gives you full control over retries and logging. That’s useful when you want end-to-end visibility.</p><p>On the other hand, if you prefer a thinner integration layer, the official SDKs handle the API contract for you and reduce the amount of boilerplate code you need to maintain.</p><p>This consistency has a direct impact on maintenance:</p><ul><li>You avoid duplicating logic across teams</li><li>Debugging becomes more predictable</li><li>Behavior stays aligned across services</li></ul><p>If each service implements scraping differently, small inconsistencies start to add up. Standardized SDKs remove that problem before it shows up in production.</p><h2 id="How-Do-Security-IP-Rotation-and-Compliance-Work"><a href="#How-Do-Security-IP-Rotation-and-Compliance-Work" class="headerlink" title="How Do Security, IP Rotation, and Compliance Work?"></a>How Do Security, IP Rotation, and Compliance Work?</h2><p>Security reviews are often the biggest blocker for scraping projects.</p><p>Crawlbase simplifies the conversation by reducing the number of components involved.</p><h3 id="Security-Model"><a href="#Security-Model" class="headerlink" title="Security Model"></a>Security Model</h3><ul><li>Token-based authentication</li><li>HTTPS-only communication</li><li>Built-in IP rotation</li></ul><p>This replaces:</p><ul><li>Custom proxy infrastructure</li><li>IP reputation management</li><li>Manual rotation logic</li></ul><p>Instead of presenting multiple moving parts to your security team, you present a single, controlled integration point.</p><h3 id="Compliance-Considerations"><a href="#Compliance-Considerations" class="headerlink" title="Compliance Considerations"></a>Compliance Considerations</h3><p>Crawlbase provides infrastructure. You remain responsible for data usage.</p><p>That includes:</p><ul><li>GDPR compliance</li><li>Terms of service adherence</li><li>Internal data policies</li></ul><p>Legal teams will typically ask about:</p><ul><li>Data Processing Agreements (DPA)</li><li>Subprocessors</li><li>Data residency</li></ul><p>These are standard vendor discussions, but they directly influence whether a solution gets approved.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">      Try our <strong class="text-underline">Crawling API</strong> to automate your data collection — used by 70k+ dev      teams    </p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a        class="banner-btn"        href="/signup?signup=blog-smart-cta"        title="Get Started Now!"        onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'web-scraping-api-for-enterprise', 'cta_type': 'try_crawling_api', 'cta_position': 'bottom','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/web-scraping-api-for-enterprise/', 'page_title': 'Web Scraping API for Enterprise - What CTOs Look For' });"        >Get Started Now!</a      >    </div>  </div></div><h2 id="Crawling-API-vs-Enterprise-Crawler-Which-One-Fits-Your-Architecture"><a href="#Crawling-API-vs-Enterprise-Crawler-Which-One-Fits-Your-Architecture" class="headerlink" title="Crawling API vs Enterprise Crawler: Which One Fits Your Architecture?"></a>Crawling API vs Enterprise Crawler: Which One Fits Your Architecture?</h2><p>Choosing between synchronous and asynchronous models depends on the workload.</p><table><thead><tr><th>Feature</th><th>Crawling API (Sync)</th><th>Enterprise Crawler (Async)</th></tr></thead><tbody><tr><td>Model</td><td>Request → Response</td><td>Push → Webhook</td></tr><tr><td>Use Case</td><td>Real-time pipelines</td><td>High-volume batch jobs</td></tr><tr><td>Scaling</td><td>Limited by request cycle</td><td>Queue-based scaling</td></tr><tr><td>Setup</td><td>Simple</td><td>Requires webhook</td></tr></tbody></table><h3 id="When-to-Switch"><a href="#When-to-Switch" class="headerlink" title="When to Switch"></a>When to Switch</h3><p>If you are processing 10,000+ URLs per day, synchronous requests can become inefficient.</p><p>The Enterprise Crawler solves this by offloading execution and managing large-scale job distribution.</p><h3 id="How-Does-Enterprise-Crawler-Improve-Success-Rates"><a href="#How-Does-Enterprise-Crawler-Improve-Success-Rates" class="headerlink" title="How Does Enterprise Crawler Improve Success Rates?"></a>How Does Enterprise Crawler Improve Success Rates?</h3><p>Enterprise Crawler handles retries within the Crawlbase infrastructure:</p><ul><li>Automatic retry handling for transient failures</li><li>Queue-based execution reduces collisions</li><li>Built-in handling for rate limits and temporary blocks</li></ul><p>This results in near 100% success rates in most jobs, especially for large-scale workloads where retry coordination becomes difficult to manage on the client side.</p><p>This is a key architectural shift:</p><ul><li><a href="https://crawlbase.com/docs/crawling-api/">Crawling API</a> → you manage retries (real-time model)</li><li><a href="https://crawlbase.com/docs/crawler/">Enterprise Crawler</a> → retries are handled for you (async model)</li></ul><p>If your pipeline requires complete datasets with minimal gaps, the async model is usually the safer option.</p><h4 id="Example-Request"><a href="#Example-Request" class="headerlink" title="Example Request"></a>Example Request</h4><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">params = &#123;</span><br><span class="line">    <span class="string">&quot;token&quot;</span>: token,</span><br><span class="line">    <span class="string">&quot;url&quot;</span>: url,</span><br><span class="line">    <span class="string">&quot;callback&quot;</span>: <span class="literal">True</span>,</span><br><span class="line">    <span class="string">&quot;crawler&quot;</span>: crawler_name,</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> page_wait <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>:</span><br><span class="line">    params[<span class="string">&quot;page_wait&quot;</span>] = page_wait</span><br><span class="line">resp = requests.get(CRAWLBASE_API_BASE, params=params, timeout=timeout)</span><br><span class="line"><span class="keyword">return</span> resp.json()</span><br></pre></td></tr></table></figure><p>Instead of waiting for each response, you receive a request ID immediately. Results are delivered asynchronously via <a href="https://crawlbase.com/docs/crawler/receiving/">webhook</a>.</p><h2 id="How-Does-Crawlbase-Compare-to-Traditional-Scraping-Setups"><a href="#How-Does-Crawlbase-Compare-to-Traditional-Scraping-Setups" class="headerlink" title="How Does Crawlbase Compare to Traditional Scraping Setups?"></a>How Does Crawlbase Compare to Traditional Scraping Setups?</h2><table><thead><tr><th>Capability</th><th>Crawlbase</th><th>DIY Setup</th></tr></thead><tbody><tr><td>Proxy Management</td><td>Built-in</td><td>Manual</td></tr><tr><td>CAPTCHA Handling</td><td>Automated</td><td>External tools</td></tr><tr><td>Retry Logic</td><td>Client-controlled or infrastructure-handled</td><td>Must build</td></tr><tr><td>Scaling</td><td>Token-based</td><td>Infrastructure scaling</td></tr><tr><td>Maintenance</td><td>Low</td><td>High</td></tr><tr><td>Time to First Success</td><td>Hours</td><td>Weeks</td></tr></tbody></table><p>This is the core trade-off:</p><ul><li>Crawlbase: pay for abstraction</li><li>DIY: pay with engineering time</li></ul><p>Most teams move away from DIY once scraping becomes critical to the business.</p><h2 id="What-Questions-Should-You-Ask-in-a-Vendor-Evaluation-Call"><a href="#What-Questions-Should-You-Ask-in-a-Vendor-Evaluation-Call" class="headerlink" title="What Questions Should You Ask in a Vendor Evaluation Call?"></a>What Questions Should You Ask in a Vendor Evaluation Call?</h2><p>Use this as a practical scorecard:</p><ul><li>Throughput: What are the real limits per token or account?</li><li>Billing: What qualifies as a successful request?</li><li>Reliability: Are failure modes documented?</li><li>Retry Strategy: Who is responsible for retries?</li><li>Compliance: Who handles legal requirements like DPA?</li><li>Scaling Model: What options exist for high-volume workloads?</li></ul><p>For Crawlbase specifically:</p><ul><li>How does pay-per-success scale with volume?</li><li>When should you move to Enterprise Crawler?</li></ul><h2 id="What-This-Means-for-Your-Team"><a href="#What-This-Means-for-Your-Team" class="headerlink" title="What This Means for Your Team"></a>What This Means for Your Team</h2><p>A web scraping API for enterprise should reduce operational burden, not shift it onto your engineers.</p><p>If your team is still managing proxies, tuning retries, and maintaining rendering infrastructure, you are effectively running a scraping platform internally. That might work early on, but it does not scale without increasing complexity, cost, and risk.</p><p>At some point, the question shifts from “Can we build this?” to “Should we keep maintaining it?”</p><p>The next step is not another comparison spreadsheet. It’s validating your actual workload against a system that can handle it consistently, without requiring your team to own the underlying infrastructure.</p><p><a href="https://crawlbase.com/signup?signup=blog">Schedule an enterprise demo</a> with Crawlbase and see how it fits your workflow.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="What-is-a-web-scraping-API-for-enterprise"><a href="#What-is-a-web-scraping-API-for-enterprise" class="headerlink" title="What is a web scraping API for enterprise?"></a>What is a web scraping API for enterprise?</h3><p>An enterprise web scraping API is a managed service that handles large-scale data collection from websites, including proxy rotation, CAPTCHA solving, and anti-bot handling, via a single API, so engineering teams don’t need to build or maintain scraping infrastructure themselves.</p><h3 id="How-does-Crawlbase-handle-enterprise-scale-traffic"><a href="#How-does-Crawlbase-handle-enterprise-scale-traffic" class="headerlink" title="How does Crawlbase handle enterprise-scale traffic?"></a>How does Crawlbase handle enterprise-scale traffic?</h3><p>Crawlbase supports up to 20 requests per second per token (extendable for enterprise workloads), built-in IP rotation, and pay-per-success billing. For high-volume jobs (10,000+ URLs&#x2F;day), the async Enterprise Crawler model manages retries and queue-based execution automatically.</p><h3 id="What’s-the-difference-between-the-Crawling-API-and-Enterprise-Crawler"><a href="#What’s-the-difference-between-the-Crawling-API-and-Enterprise-Crawler" class="headerlink" title="What’s the difference between the Crawling API and Enterprise Crawler?"></a>What’s the difference between the Crawling API and Enterprise Crawler?</h3><p>The Crawling API is synchronous; you send a request and wait for a response, suitable for real-time pipelines. The Enterprise Crawler is asynchronous; you submit URLs and receive results via webhook, designed for high-volume batch jobs where near-100% completion rates are required.</p><h3 id="Which-is-the-best-web-scraping-API-for-enterprise"><a href="#Which-is-the-best-web-scraping-API-for-enterprise" class="headerlink" title="Which is the best web scraping API for enterprise?"></a>Which is the best web scraping API for enterprise?</h3><p>The best enterprise web scraping API depends on your team’s priorities. Crawlbase stands out for enterprise use due to its pay-per-success billing model, built-in anti-bot and proxy management, and multi-language SDK support.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;A web scraping API for enterprise should give you three things: predictable scaling, reliable data delivery close to 100% completion, and a system your security and finance teams can approve without friction. Anything less turns into engineering overhead.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="enterprise web scraping" scheme="https://crawlbase.com/blog/tags/enterprise-web-scraping/"/>
    
    <category term="web scraping API for enterprise" scheme="https://crawlbase.com/blog/tags/web-scraping-API-for-enterprise/"/>
    
    <category term="scraping API scalability" scheme="https://crawlbase.com/blog/tags/scraping-API-scalability/"/>
    
    <category term="enterprise data pipeline" scheme="https://crawlbase.com/blog/tags/enterprise-data-pipeline/"/>
    
    <category term="best web scraping API for CTOs" scheme="https://crawlbase.com/blog/tags/best-web-scraping-API-for-CTOs/"/>
    
    <category term="enterprise scraping compliance GDPR" scheme="https://crawlbase.com/blog/tags/enterprise-scraping-compliance-GDPR/"/>
    
    <category term="crawlbase enterprise review" scheme="https://crawlbase.com/blog/tags/crawlbase-enterprise-review/"/>
    
  </entry>
  
  <entry>
    <title>How to Scrape Local Business Listings with Python and Crawlbase</title>
    <link href="https://crawlbase.com/blog/scrape-local-business-listings-guide/"/>
    <id>https://crawlbase.com/blog/scrape-local-business-listings-guide/</id>
    <published>2026-03-29T22:36:36.000Z</published>
    <updated>2026-04-24T11:53:23.787Z</updated>
    
    <content type="html"><![CDATA[<p>Scraping local business listings from Google Maps, Yelp, and Yellow Pages gives sales, marketing, and research teams structured data at a scale that manual collection cannot match. This guide shows you how to <a href="https://crawlbase.com/google-serp-scraper">build a Python pipeline using Crawlbase</a> that retrieves fully rendered listing pages and extracts structured fields, including business name, address, phone number, hours, and ratings, across hundreds of cities in a single run.</p><span id="more"></span><h2 id="TL-DR-Scrape-Local-Business-Listings"><a href="#TL-DR-Scrape-Local-Business-Listings" class="headerlink" title="TL;DR: Scrape Local Business Listings"></a>TL;DR: Scrape Local Business Listings</h2><p>Scraping local business listings (from platforms like Google Maps, Yelp, and Yellow Pages) lets you collect structured data—such as names, addresses, phone numbers, hours, and ratings—at scale for lead generation, CRM enrichment, and market research.</p><p>However, doing this manually or with basic scripts breaks down due to geo-dependent results, JavaScript-rendered content, and anti-bot protections like IP blocking and CAPTCHA.</p><p>The solution is to separate data retrieval from parsing: use a tool like <a href="https://crawlbase.com/signup?signup=blog">Crawlbase</a> to handle rendering, proxy rotation, and geo-targeting, then extract structured data into JSON using your own parser.</p><p>In practice, you build a Python pipeline that:</p><ul><li>Sends location-based queries (e.g., “plumbers in Austin”)</li><li>Retrieves fully rendered pages</li><li>Extracts key business data</li><li>Scales across multiple cities in one run</li></ul><p>The result is a clean, scalable dataset you can directly use for sales, marketing, or analysis—without managing scraping infrastructure yourself. For the complete production-ready implementation, refer to the <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings">Project repository on ScraperHub</a></p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Scale Without the Speed Wobbles</h2>  </div>  <p class="banner-body">In recent benchmarks, Crawlbase maintained consistent response times even as request volume quintupled. Whether you're running 2 or 10 req/s, we provide the steady performance your data pipeline needs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Build a Scalable Scraper">Build a Scalable Scraper</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="What-Is-Local-Business-Listing-Data"><a href="#What-Is-Local-Business-Listing-Data" class="headerlink" title="What Is Local Business Listing Data?"></a>What Is Local Business Listing Data?</h2><p>Local business listing data is the structured information you see when searching for services in a specific area. When you look up something like “plumbers in Austin” or “restaurants in Denver,” the results are built from standardized fields that describe each business.</p><p>At a minimum, this typically includes:</p><p>• Business name<br>• Address<br>• Phone number<br>• Opening hours<br>• Rating and reviews</p><p>Most platforms, including <a href="https://crawlbase.com/blog/scrape-data-from-google-maps/">Google Maps</a>, <a href="https://crawlbase.com/blog/scrape-yelp/">Yelp</a>, and <a href="https://crawlbase.com/blog/scrape-yellow-pages/">Yellow Pages</a>, present this information in a consistent format because it needs to be searchable and comparable across locations.</p><p>This data is used in practice for:</p><ul><li><strong>Lead generation:</strong> building targeted prospect lists by city or category</li><li><strong>CRM enrichment:</strong> keeping sales records up to date with verified contact info</li><li><strong>Competitive research:</strong> mapping competitor density and ratings across markets</li></ul><p>The value comes from having that information structured and consistent across many cities.</p><h2 id="Why-Scraping-Local-Listings-Is-Difficult"><a href="#Why-Scraping-Local-Listings-Is-Difficult" class="headerlink" title="Why Scraping Local Listings Is Difficult"></a>Why Scraping Local Listings Is Difficult</h2><p>Collecting this data at scale is not as straightforward as sending requests and parsing HTML. The difficulty comes from how local platforms generate and protect their results.</p><h3 id="Geo-Dependent-Results"><a href="#Geo-Dependent-Results" class="headerlink" title="Geo-Dependent Results"></a>Geo-Dependent Results</h3><p>Local search results are tied directly to location. A simple query like “plumbers” will return completely different businesses depending on whether the request comes from Austin, Denver, or Phoenix.</p><p>To get reliable data, you need to control both:</p><p>• the query itself (include the city)<br>• the request location (geo-targeting)</p><p>Without that, results shift unpredictably, and datasets become inconsistent.</p><h3 id="JavaScript-Rendering"><a href="#JavaScript-Rendering" class="headerlink" title="JavaScript Rendering"></a>JavaScript Rendering</h3><p>Most modern listing platforms do not return complete content in the initial response.</p><p>Instead, the server returns a basic HTML structure, and listings are injected later through JavaScript.</p><p>This means a standard HTTP request often misses the actual business data entirely. Without rendering the page like a browser, you end up with incomplete results.</p><h3 id="Blocking-and-Rate-Limits"><a href="#Blocking-and-Rate-Limits" class="headerlink" title="Blocking and Rate Limits"></a>Blocking and Rate Limits</h3><p>Once you scale beyond a few requests, platforms start applying restrictions.</p><p>Common issues include:</p><p>• IP blocking<br>• CAPTCHA challenges<br>• and Request throttling</p><p>These protections make large-scale scraping unreliable unless you handle them properly.</p><h2 id="Why-Use-Crawlbase-for-Local-Listing-Scraping"><a href="#Why-Use-Crawlbase-for-Local-Listing-Scraping" class="headerlink" title="Why Use Crawlbase for Local Listing Scraping"></a>Why Use Crawlbase for Local Listing Scraping</h2><p>This is where <a href="https://crawlbase.com/signup?signup=blog">Crawlbase fits in</a>. Instead of building and maintaining your own scraping infrastructure, you use it as the retrieval layer.</p><p>Crawlbase supports both Standard and JavaScript-based requests, depending on the type of page you’re scraping:</p><p>• Use the <strong>Normal token</strong> for simple, static pages<br>• Use the <strong>JavaScript token</strong> for dynamic pages like Google Maps and Yelp</p><p>When using the <a href="https://crawlbase.com/docs/crawling-api/headless-browsers/#how-it-works">JavaScript token</a>, the page is rendered the same way a real browser would load it. That means the HTML you receive already includes dynamically loaded listings.</p><p>At the same time, Crawlbase handles:</p><p>• Proxy rotation and IP management<br>• Anti-bot protections<br>• Geo-targeted requests</p><p>This combination solves the core issues outlined earlier.</p><p>The main advantage is consistency. You’re working with:</p><p>• complete, rendered pages when needed<br>• fewer blocked requests<br>• stable responses across locations</p><p>Instead of debugging infrastructure issues, you can focus on extracting and structuring the data.</p><p>This becomes especially important when running the same queries across hundreds of cities, where small inconsistencies quickly affect data quality.</p><h2 id="What-You’re-Building"><a href="#What-You’re-Building" class="headerlink" title="What You’re Building"></a>What You’re Building</h2><p>At a high level, the pipeline starts with a query and a city. For example, “plumbers in Austin.” This input is sent to the <strong><a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawlbase Crawling API</a></strong>, which retrieves the page on your behalf.</p><p>Instead of returning partial HTML, Crawlbase can load the page like a real browser, so the response already includes all dynamically rendered listings. This is important for platforms like Google Maps and Yelp, where most of the content is loaded after the initial request.</p><p>Once the rendered HTML is returned, your parser extracts the fields you care about, such as name, address, phone number, hours, and ratings. Each listing is then converted into a structured format.</p><p>The result is a clean JSON dataset that can be used directly for lead generation, CRM systems, or analysis.</p><img src="/blog/scrape-local-business-listings-guide/scrape-local-business-listings-guide-workflow.jpg" class="" title="Workflow diagram for scraping local business listings, a city and search query are sent to the Crawlbase Crawling API, which returns fully rendered HTML " alt="Workflow diagram for scraping local business listings, a city and search query are sent to the Crawlbase Crawling API, which returns fully rendered HTML"><p>This separation between retrieval and parsing is what makes the system scalable across multiple cities and sources.</p><h2 id="Step-by-Step-Building-the-Scraper"><a href="#Step-by-Step-Building-the-Scraper" class="headerlink" title="Step-by-Step: Building the Scraper"></a>Step-by-Step: Building the Scraper</h2><p>To implement this pipeline, you can use the complete working scraper available in the <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings">ScraperHub repository</a>.</p><p>The steps below show how to set it up locally and run it end-to-end.</p><h3 id="Step-1-Get-the-Code-from-ScraperHub"><a href="#Step-1-Get-the-Code-from-ScraperHub" class="headerlink" title="Step 1: Get the Code from ScraperHub"></a>Step 1: Get the Code from ScraperHub</h3><p>Start by downloading the project.</p><p>From the ScraperHub repository:</p><p>• <a href="https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository">clone the repository</a><br>• or download it as a ZIP and extract it</p><p>Example (using git):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/ScraperHub/how-to-scrape-local-business-listings.git</span><br><span class="line"><span class="built_in">cd</span> how-to-scrape-local-business-listings</span><br></pre></td></tr></table></figure><p>After cloning, your project structure should look like this:</p><p>• <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/config.py">config.py</a> → handles tokens, retries, settings<br>• <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/fetcher.py">fetcher.py</a> → Crawlbase API requests<br>• <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/url_builder.py">url_builder.py</a> → builds URLs for Google Maps, Yelp, Yellow Pages<br>• <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/parser.py">parser.py</a> → extracts structured data<br>• <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/main.py">main.py</a> → entry point (CLI)</p><h3 id="Step-2-Set-Up-Your-Environment"><a href="#Step-2-Set-Up-Your-Environment" class="headerlink" title="Step 2: Set Up Your Environment"></a>Step 2: Set Up Your Environment</h3><p>Create a virtual environment and install dependencies:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">python3 -m venv .venv</span><br><span class="line"><span class="built_in">source</span> .venv/bin/activate</span><br><span class="line"><span class="comment"># Windows: .venv\Scripts\activate</span></span><br><span class="line"></span><br><span class="line">pip install -r requirements.txt</span><br></pre></td></tr></table></figure><p>Next, set your Crawlbase tokens as environment variables:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">export</span> CRAWLBASE_TOKEN=your_normal_token</span><br><span class="line"><span class="built_in">export</span> CRAWLBASE_JS_TOKEN=your_js_token</span><br></pre></td></tr></table></figure><p>Notes:</p><p>• both tokens are required<br>• the JS token is needed for Google Maps and Yelp<br>• token loading is handled in <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/config.py">config.py</a></p><h3 id="Step-3-Run-the-Scraper-Single-City"><a href="#Step-3-Run-the-Scraper-Single-City" class="headerlink" title="Step 3: Run the Scraper (Single City)"></a>Step 3: Run the Scraper (Single City)</h3><p>Run your first test:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;plumbers&quot;</span> --cities <span class="string">&quot;Austin&quot;</span> -o output.json</span><br></pre></td></tr></table></figure><p>What happens here:</p><p>• builds the query URL<br>• fetches the page via Crawlbase<br>• parses listings<br>• writes structured JSON output</p><h3 id="Step-4-Scale-Across-Multiple-Cities"><a href="#Step-4-Scale-Across-Multiple-Cities" class="headerlink" title="Step 4: Scale Across Multiple Cities"></a>Step 4: Scale Across Multiple Cities</h3><p>Scale the same query across cities:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;restaurants&quot;</span> --cities <span class="string">&quot;Austin&quot;</span> <span class="string">&quot;Denver&quot;</span> <span class="string">&quot;Phoenix&quot;</span> -o listings.json</span><br></pre></td></tr></table></figure><p>This is the core of multi-city scraping.</p><p>Instead of one dataset, you now collect listings across multiple markets in a single run.</p><h3 id="Step-5-Use-Geo-Targeting"><a href="#Step-5-Use-Geo-Targeting" class="headerlink" title="Step 5: Use Geo-Targeting"></a>Step 5: Use Geo-Targeting</h3><p>For international or location-specific accuracy:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;electricians&quot;</span> --cities <span class="string">&quot;London&quot;</span> --country UK</span><br></pre></td></tr></table></figure><p>This ensures results match the target market.</p><h3 id="Step-6-Switch-Data-Sources"><a href="#Step-6-Switch-Data-Sources" class="headerlink" title="Step 6: Switch Data Sources"></a>Step 6: Switch Data Sources</h3><p>You can switch between platforms.</p><p>Example using Yelp:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py <span class="string">&quot;plumbers&quot;</span> --cities <span class="string">&quot;Austin&quot;</span> --<span class="built_in">source</span> yelp</span><br></pre></td></tr></table></figure><p>Supported sources include:</p><p>• Google Maps<br>• Yelp<br>• Yellow Pages</p><p>URL generation is handled in <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/url_builder.py">url_builder.py</a>.</p><h3 id="Step-7-Scale-With-Enterprise-Crawler"><a href="#Step-7-Scale-With-Enterprise-Crawler" class="headerlink" title="Step 7: Scale With Enterprise Crawler"></a>Step 7: Scale With Enterprise Crawler</h3><p>For larger workloads, such as running hundreds of cities across multiple queries, the Crawling API works well for on-demand requests. However, when you’re dealing with large batches of URLs, the <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">Enterprise Crawler</a> is a more suitable option.</p><p>It’s designed for bulk processing using an asynchronous, push-based model.</p><p>The transition is simple. You can reuse your existing setup and add:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">params[<span class="string">&quot;callback&quot;</span>] = <span class="literal">True</span></span><br><span class="line">params[<span class="string">&quot;crawler&quot;</span>] = <span class="string">&quot;LocalListingsCrawler&quot;</span> <span class="comment">#your custom Crawler name</span></span><br></pre></td></tr></table></figure><p>Instead of sending a request and waiting for each response, you simply push your URLs to the Crawler. It processes everything in the background, and the results are delivered once they’re ready.</p><p>In terms of handling the results, you can choose how you want to receive the results.</p><p>• <a href="https://crawlbase.com/cloud-storage-for-crawling-and-scraping">Crawlbase Cloud Storage</a>: Crawlbase stores the data for you, and you can retrieve it later<br>• <a href="https://zapier.com/blog/what-are-webhooks/">Webhook</a>: results are sent directly to your endpoint as soon as they’re ready</p><p>If you go with a webhook, you’ll need to set up an endpoint to receive the data and store it in your system.</p><p>Check the <a href="https://crawlbase.com/docs/crawler">Enterprise Crawler documentation</a> for the complete details.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">Try our <strong class="text-underline">Crawling API</strong>  to automate your data collection — used by 70k+ dev teams</p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get Started Now!" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'scrape-local-business-listings-guide', 'cta_type': 'try_crawling_api', 'cta_position': 'top','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/scrape-local-business-listings-guide/', 'page_title': 'How to Scrape Local Business Listings with Python and Crawlbase' });">Get Started Now!</a>    </div>  </div>  </div><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>Scraping local business listings only becomes useful when it scales across locations. Collecting a few results manually is easy. Building a system that can consistently extract structured data across hundreds of cities is where the real value comes in.</p><p>That comes down to three things working together:</p><p>• Geo-targeted queries for accurate local results<br>• Reliable page retrieval for complete data<br>• Structured parsing for usable output</p><p>Crawlbase handles the retrieval layer, so you don’t have to deal with rendering issues, blocking, or proxy management. That lets you focus on building datasets that are actually usable for lead generation, sales, or analysis.</p><p>To get started, <a href="https://crawlbase.com/signup?signup=blog">create a Crawlbase account</a> and get your API tokens. Use your 1,000 free requests to test the full pipeline before scaling.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="Can-I-scrape-multiple-cities-at-once"><a href="#Can-I-scrape-multiple-cities-at-once" class="headerlink" title="Can I scrape multiple cities at once?"></a>Can I scrape multiple cities at once?</h3><p>Yes. Pass multiple cities in a single command using the <code>--cities</code> flag. The scraper runs the same query across each location and combines results into one structured JSON file.</p><h3 id="What-Crawlbase-tokens-do-I-need-to-run-this-project"><a href="#What-Crawlbase-tokens-do-I-need-to-run-this-project" class="headerlink" title="What Crawlbase tokens do I need to run this project?"></a>What Crawlbase tokens do I need to run this project?</h3><p>You need both the normal token and the JavaScript token.</p><p>• <code>CRAWLBASE_TOKEN</code> is used for standard requests<br>• <code>CRAWLBASE_JS_TOKEN</code> is required for JavaScript-heavy pages like Google Maps and Yelp</p><p>Without the JavaScript token, most listing data will not load properly because the content is rendered dynamically.</p><p>You can find your tokens in your <a href="https://crawlbase.com/dashboard/account/docs">Crawlbase account dashboard</a>.</p><h3 id="Can-I-switch-data-sources"><a href="#Can-I-switch-data-sources" class="headerlink" title="Can I switch data sources?"></a>Can I switch data sources?</h3><p>Yes. The scraper supports multiple platforms. You can switch sources using the –source flag, for example:</p><p>• <code>--source yelp</code><br>• (other sources supported in <a href="https://github.com/ScraperHub/how-to-scrape-local-business-listings/blob/main/url_builder.py">url_builder.py</a>)</p><p>Each source has slightly different HTML structures, but the parser normalizes the output into a consistent format.</p><h3 id="How-do-I-handle-large-scale-scraping-hundreds-of-cities"><a href="#How-do-I-handle-large-scale-scraping-hundreds-of-cities" class="headerlink" title="How do I handle large-scale scraping (hundreds of cities)?"></a>How do I handle large-scale scraping (hundreds of cities)?</h3><p>For larger workloads, you can use the Crawlbase Enterprise Crawler. Instead of making synchronous requests, you push URLs and receive results via webhook. This improves throughput and avoids bottlenecks when processing thousands of queries.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Scraping local business listings from Google Maps, Yelp, and Yellow Pages gives sales, marketing, and research teams structured data at a scale that manual collection cannot match. This guide shows you how to &lt;a href=&quot;https://crawlbase.com/google-serp-scraper&quot;&gt;build a Python pipeline using Crawlbase&lt;/a&gt; that retrieves fully rendered listing pages and extracts structured fields, including business name, address, phone number, hours, and ratings, across hundreds of cities in a single run.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="scrape local business listings" scheme="https://crawlbase.com/blog/tags/scrape-local-business-listings/"/>
    
    <category term="scrape local business leads" scheme="https://crawlbase.com/blog/tags/scrape-local-business-leads/"/>
    
    <category term="scrape business leads" scheme="https://crawlbase.com/blog/tags/scrape-business-leads/"/>
    
  </entry>
  
  <entry>
    <title>How to Scrape Customer Reviews | Full Python Pipeline Guide</title>
    <link href="https://crawlbase.com/blog/how-to-scrape-customer-reviews/"/>
    <id>https://crawlbase.com/blog/how-to-scrape-customer-reviews/</id>
    <published>2026-03-29T22:29:51.000Z</published>
    <updated>2026-04-24T11:53:23.319Z</updated>
    
    <content type="html"><![CDATA[<p>To scrape customer reviews at scale, you need to render JavaScript-heavy pages, systematically collect all review pages or scroll-loaded content, and extract key fields like rating, text, and date into a structured format.</p><span id="more"></span><p>Most review platforms do not return full content through simple requests. Reviews are loaded dynamically, paginated across dozens or hundreds of pages, and protected by rate limits or bot detection. That’s why a browser-based crawling layer, combined with a consistent parsing approach, is required if you want reliable results beyond a few pages.</p><p>This guide demonstrates how to build a production-grade review scraping system that processes thousands of reviews daily with 95%+ extraction accuracy, using browser-based rendering to handle JavaScript-heavy platforms and structured parsing for <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">reliable data collection</a>.</p><h2 id="TL-DR-Scrape-Customer-Reviews"><a href="#TL-DR-Scrape-Customer-Reviews" class="headerlink" title="TL;DR Scrape Customer Reviews"></a>TL;DR Scrape Customer Reviews</h2><p>Scraping customer reviews at scale is difficult because most platforms load content dynamically, paginate across hundreds of pages, and actively block automated requests, making basic HTTP scripts unreliable.</p><p>To build a reliable pipeline, you need browser-based rendering for JavaScript content, structured pagination to capture all reviews, and consistent parsing to extract fields like ratings, text, and dates into usable data.</p><p>Managing this infrastructure, headless browsers, proxies, retries, and anti-bot handling quickly becomes complex. Crawlbase simplifies this by handling rendering, blocking, and scaling through a single API, so you can focus on extracting and analyzing review data instead of maintaining scraping systems.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Scale Without the Speed Wobbles</h2>  </div>  <p class="banner-body">In recent benchmarks, Crawlbase maintained consistent response times even as request volume quintupled. Whether you're running 2 or 10 req/s, we provide the steady performance your data pipeline needs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Build a Scalable Scraper">Build a Scalable Scraper</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="What-Is-Review-Scraping-and-Why-Does-It-Matter"><a href="#What-Is-Review-Scraping-and-Why-Does-It-Matter" class="headerlink" title="What Is Review Scraping and Why Does It Matter?"></a>What Is Review Scraping and Why Does It Matter?</h2><p>Review scraping is the automated process of extracting customer feedback from e-commerce sites, review platforms, and business directories at scale. To scrape customer reviews at scale effectively, you need three core components: JavaScript rendering to handle dynamic content, systematic pagination management to capture all available reviews, and structured data extraction that transforms raw HTML into analyzable datasets.</p><p>Most businesses rely on review data for competitive intelligence, with companies analyzing an average of 2,500-10,000 reviews monthly to inform product decisions. A structured review scraping pipeline typically achieves 92-98% data accuracy when properly configured, compared to 60-75% accuracy from basic HTTP requests that miss JavaScript-loaded content.</p><p>The challenge isn’t just collecting reviews but maintaining data quality as you scale. Review platforms actively update their anti-bot defenses, with major sites changing their HTML structure every 45-90 days on average. This means your scraping infrastructure must balance reliability, maintainability, and adaptability.</p><h3 id="What-Do-You-Extract-From-Customer-Reviews"><a href="#What-Do-You-Extract-From-Customer-Reviews" class="headerlink" title="What Do You Extract From Customer Reviews?"></a>What Do You Extract From Customer Reviews?</h3><p>The most actionable review data comes from five core fields that appear consistently across platforms, which are:</p><ul><li><strong>Rating:</strong> This provides quantitative sentiment on a standardized scale. Most platforms use 1-5 stars, though some, like G2, use 1-10 scales that require normalization.</li><li><strong>Review text:</strong> It contains qualitative insights that reveal specific product strengths and pain points. Text fields typically range from 50 to 500 words, with longer reviews correlating to 40% higher helpful vote counts.</li><li><strong>Publication date:</strong> It enables time-series analysis to track sentiment changes after product updates or competitive launches.</li><li><strong>Verified purchase status:</strong> It helps filter out potentially biased reviews. Verified reviews carry 3.2x more weight in consumer purchase decisions according to 2024 trust metrics.</li><li><strong>Helpful votes:</strong> These surface the most informative reviews, with top-voted reviews receiving 8-12x more views than average ratings.</li></ul><p>The critical factor is structural consistency. When Amazon represents ratings as integers (1-5), Trustpilot uses decimals (4.5), and G2 uses a different scale entirely, inconsistent data structures make cross-platform analysis impossible without normalization.</p><p>A simple unified format works well:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;rating&quot;</span><span class="punctuation">:</span> <span class="number">4.5</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;text&quot;</span><span class="punctuation">:</span> <span class="string">&quot;Great product&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;date&quot;</span><span class="punctuation">:</span> <span class="string">&quot;2025-01-10&quot;</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;verified&quot;</span><span class="punctuation">:</span> <span class="literal"><span class="keyword">true</span></span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;helpful_votes&quot;</span><span class="punctuation">:</span> <span class="number">12</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;url&quot;</span><span class="punctuation">:</span> <span class="string">&quot;source_url&quot;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>Once everything fits this structure, you can compare across platforms without extra cleanup.</p><h2 id="How-Do-You-Handle-JavaScript-Heavy-Review-Pages"><a href="#How-Do-You-Handle-JavaScript-Heavy-Review-Pages" class="headerlink" title="How Do You Handle JavaScript-Heavy Review Pages?"></a>How Do You Handle JavaScript-Heavy Review Pages?</h2><p>Modern review platforms render 80-95% of their content through client-side <a href="https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Frameworks_libraries">JavaScript frameworks</a> like React, Vue, or Angular. A standard HTTP request to these sites returns incomplete HTML because the actual review content loads after the initial page response.</p><p>Consider what happens with a basic request: the server returns a minimal HTML skeleton containing mostly div placeholders and script tags. The reviews themselves load through subsequent API calls triggered by JavaScript execution, often 2-4 seconds after the initial page load. Some platforms implement infinite scroll, loading additional reviews only when users scroll down, making the content completely inaccessible to traditional scrapers.</p><p>Browser-based rendering solves this by executing JavaScript exactly as a real browser would. The scraper waits for dynamic content to load, captures scroll-triggered elements, and returns fully populated HTML ready for parsing. This approach achieves 95%+ content capture rates compared to 40-60% from direct HTTP requests.</p><p>Building this infrastructure yourself requires managing headless browsers like Puppeteer or Playwright, maintaining proxy rotation to avoid IP blocks (typically 10-50 proxies for moderate-scale scraping), implementing retry logic for failed requests, and handling CAPTCHA challenges that appear on 15-30% of requests at scale.</p><h3 id="Why-Use-Crawlbase-for-Review-Scraping"><a href="#Why-Use-Crawlbase-for-Review-Scraping" class="headerlink" title="Why Use Crawlbase for Review Scraping?"></a>Why Use Crawlbase for Review Scraping?</h3><p>Crawlbase removes that layer entirely. Instead of building your own infrastructure, you send a <a href="https://crawlbase.com/docs/crawling-api/headless-browsers/">JavaScript request</a> and get back fully rendered HTML. The page is loaded the same way a real browser would load it.</p><p>What you get:</p><ul><li>JavaScript execution out of the box</li><li>automatic IP rotation</li><li>built-in handling for blocking and rate limits</li><li>consistent HTML output</li></ul><p>There are two ways to implement Crawlbase:</p><ul><li><a href="https://crawlbase.com/docs/crawling-api">Crawling API</a> for on-demand requests</li><li><a href="https://crawlbase.com/docs/crawler">Enterprise Crawler</a> for large batches with <a href="https://en.wikipedia.org/wiki/Webhook">webhook</a> delivery</li></ul><h2 id="What-Are-the-Core-Components-of-a-Review-Scraping-Pipeline"><a href="#What-Are-the-Core-Components-of-a-Review-Scraping-Pipeline" class="headerlink" title="What Are the Core Components of a Review Scraping Pipeline?"></a>What Are the Core Components of a Review Scraping Pipeline?</h2><p>At a high level, review scraping is just a pipeline. Each step takes raw input and makes it more usable.</p><p>Here’s what that looks like in practice:</p><img src="/blog/how-to-scrape-customer-reviews/how-to-scrape-customer-reviews-workflow.jpg" class="" title="Workflow diagram of a review scraping pipeline: review sites connect to the Crawlbase API, then a fetcher layer, platform-specific parsers, JSONL storage, and finally sentiment analysis or dashboards. " alt="Workflow diagram of a review scraping pipeline: review sites connect to the Crawlbase API, then a fetcher layer, platform-specific parsers, JSONL storage, and finally sentiment analysis or dashboards."><ol><li><strong>Review sites</strong><br>These are your data sources. <a href="https://www.amazon.com/">Amazon</a>, <a href="https://www.trustpilot.com/">Trustpilot</a>, <a href="https://www.g2.com/">G2</a>, <a href="https://www.yelp.com/">Yelp</a>, <a href="https://customerreviews.google.com/">Google Reviews</a>. Each has its own structure and quirks.</li><li><strong>Crawlbase API</strong><br>This is the retrieval layer. Instead of dealing with proxies, blocks, or JavaScript rendering yourself, the API returns fully rendered HTML for each page.</li><li><strong>Fetcher</strong><br>A small layer in your code that sends requests, handles parameters like page_wait, and manages retries if needed.</li><li><strong>Parsers (extension point)</strong><br>This is where platform-specific logic lives. Trustpilot, Yelp, Amazon, and G2 all need different selectors. The rest of the pipeline stays the same.</li><li><strong>JSONL storage</strong><br>Parsed reviews are stored in a structured format. JSONL works well because it’s simple and easy to stream into other systems.</li><li><strong>Sentiment&#x2F;dashboards</strong><br>Once the data is structured, you can analyze it. Sentiment models, trend dashboards, competitor comparisons. This is where the value actually comes from.</li></ol><p>A couple of practical notes:</p><ul><li>The parser layer is the only part that changes frequently</li><li>Everything else should stay stable once set up</li><li>Adding a new platform usually means adding a new parser, not rewriting the pipeline</li></ul><p>That separation is what makes the system scalable. You’re not rebuilding everything every time a site changes its layout.</p><p>Before implementing this pipeline, you just need a minimal setup. Nothing complex, just enough to fetch pages and parse the results.</p><h2 id="Getting-Started-Required-Setup-and-Configuration"><a href="#Getting-Started-Required-Setup-and-Configuration" class="headerlink" title="Getting Started: Required Setup and Configuration"></a>Getting Started: Required Setup and Configuration</h2><p>You’ll need:</p><ul><li><a href="https://www.python.org/downloads/release/python-3143/">Python</a> installed</li><li>A Crawlbase <a href="https://crawlbase.com/dashboard/account/docs">API token</a></li><li>Basic familiarity with Python</li></ul><p>You’ll also need a small set of libraries for fetching pages and parsing HTML:</p><ul><li><a href="https://pypi.org/project/requests/">requests</a></li><li><a href="https://pypi.org/project/beautifulsoup4/">beautifulsoup4</a></li><li><a href="https://docs.python.org/3/library/re.html">re</a></li><li><a href="https://docs.python.org/3/library/json.html">json</a></li><li><a href="https://docs.python.org/3/library/os.html">os</a></li></ul><p>If you’ve done any scraping before, this should feel familiar. The main difference here is that Crawlbase handles rendering and blocking, so you can focus on extraction instead of infrastructure.</p><h2 id="Step-1-How-Do-I-Fetch-a-Review-Page-Without-Getting-Blocked"><a href="#Step-1-How-Do-I-Fetch-a-Review-Page-Without-Getting-Blocked" class="headerlink" title="Step 1: How Do I Fetch a Review Page Without Getting Blocked?"></a>Step 1: How Do I Fetch a Review Page Without Getting Blocked?</h2><p>Now that the environment is set, start by pulling the fully rendered HTML.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line">token = os.environ.get(<span class="string">&quot;CRAWLBASE_TOKEN&quot;</span>)</span><br><span class="line">url = <span class="string">&quot;https://www.trustpilot.com/review/example.com&quot;</span></span><br><span class="line">params = &#123;</span><br><span class="line">    <span class="string">&quot;token&quot;</span>: token,</span><br><span class="line">    <span class="string">&quot;url&quot;</span>: url,</span><br><span class="line">    <span class="string">&quot;page_wait&quot;</span>: <span class="number">2000</span></span><br><span class="line">&#125;</span><br><span class="line">response = requests.get(</span><br><span class="line">    <span class="string">&quot;https://api.crawlbase.com/&quot;</span>,</span><br><span class="line">    params=params,</span><br><span class="line">    timeout=<span class="number">90</span></span><br><span class="line">)</span><br><span class="line">html = response.text</span><br></pre></td></tr></table></figure><p>The Crawling API keeps this simple. You send a GET request with your account token and the target URL, and it returns the fully rendered page.</p><p><strong>Quick reference:</strong></p><p>Base URL: <a href="https://api.crawlbase.com/">https://api.crawlbase.com</a></p><p>Required parameters: <code>token</code>, <code>url</code></p><p><a href="https://crawlbase.com/docs/crawling-api/parameters/">Optional parameters</a>:</p><ul><li><code>page_wait</code> for dynamic content</li><li><code>scroll=true</code> and <code>scroll_interval</code> for infinite scroll pages</li></ul><p>Recommended timeout: at least 90 seconds</p><p>Typical response time: 4 to 10 seconds</p><p>See the <a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/fetcher.py">complete fetcher script</a> implementation with retries and error handling.</p><h2 id="Step-2-What’s-the-Easiest-Way-to-Handle-Pagination"><a href="#Step-2-What’s-the-Easiest-Way-to-Handle-Pagination" class="headerlink" title="Step 2: What’s the Easiest Way to Handle Pagination?"></a>Step 2: What’s the Easiest Way to Handle Pagination?</h2><p>Fetching a single page is rarely enough. Most review platforms split content across dozens or even hundreds of pages.</p><p>For example:</p><ul><li>Trustpilot uses <code>?page=2</code>, <code>?page=3</code>, and so on</li><li>Some platforms use <code>offset</code> instead of page numbers</li><li>Others rely on infinite scroll</li></ul><p>If you only request the first page, you’re missing the majority of reviews.</p><p>The typical approach is to generate page URLs and loop through them until you reach your limit or no more reviews are returned.</p><p>Grab the complete <a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/pagination.py">pagination script</a> in ScraperHub. This includes a helper that builds paginated URLs from a base review page and handles query parameters cleanly. It also provides a utility for updating page numbers dynamically during iteration.</p><p>A few practical notes:</p><ul><li>Set a reasonable page limit to avoid unnecessary requests</li><li>Stop when a page returns no reviews</li><li>For infinite scroll pages, use <code>scroll=true</code> instead of pagination</li></ul><p>The goal is to make sure you’re collecting all available reviews, not just the first page.</p><h2 id="Step-3-How-Do-You-Parse-Review-Data-Accurately"><a href="#Step-3-How-Do-You-Parse-Review-Data-Accurately" class="headerlink" title="Step 3: How Do You Parse Review Data Accurately?"></a>Step 3: How Do You Parse Review Data Accurately?</h2><p>This is where things get platform-specific. Each site structures its HTML differently, so the parser needs to be flexible. The Trustpilot parser below is a good example of what that looks like in practice.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> re</span><br><span class="line"><span class="keyword">from</span> bs4 <span class="keyword">import</span> BeautifulSoup</span><br><span class="line"><span class="keyword">from</span> models <span class="keyword">import</span> Review</span><br><span class="line"><span class="keyword">from</span> parsers.base <span class="keyword">import</span> ReviewParser</span><br><span class="line"></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">TrustpilotParser</span>(<span class="title class_ inherited__">ReviewParser</span>):</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    Parse Trustpilot review cards from HTML.</span></span><br><span class="line"><span class="string">    Trustpilot structure: review cards in article/section containers.</span></span><br><span class="line"><span class="string">    Selectors may change; adapt for your target page structure.</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">parse</span>(<span class="params">self, html: <span class="built_in">str</span>, source_url: <span class="built_in">str</span> = <span class="string">&quot;&quot;</span></span>) -&gt; <span class="built_in">list</span>[Review]:</span><br><span class="line">        reviews: <span class="built_in">list</span>[Review] = []</span><br><span class="line">        soup = BeautifulSoup(html, <span class="string">&quot;html.parser&quot;</span>)</span><br><span class="line">        <span class="comment"># Trustpilot: review cards use .review-card, [class*=&quot;reviewCard&quot;], etc.</span></span><br><span class="line">        cards = soup.select(</span><br><span class="line">            <span class="string">&quot;article[data-review-id], &quot;</span></span><br><span class="line">            <span class="string">&quot;[data-review-id], &quot;</span></span><br><span class="line">            <span class="string">&quot;.review-card, &quot;</span></span><br><span class="line">            <span class="string">&quot;[class*=&#x27;reviewCard&#x27;], &quot;</span></span><br><span class="line">            <span class="string">&quot;[class*=&#x27;review-card&#x27;], &quot;</span></span><br><span class="line">            <span class="string">&quot;section[data-review-id]&quot;</span></span><br><span class="line">        )</span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> cards:</span><br><span class="line">            cards = soup.select(<span class="string">&#x27;[class*=&quot;review&quot;]&#x27;</span>)</span><br><span class="line">        <span class="keyword">for</span> card <span class="keyword">in</span> cards:</span><br><span class="line">            <span class="keyword">try</span>:</span><br><span class="line">                review = <span class="variable language_">self</span>._extract_review(card, source_url)</span><br><span class="line">                <span class="keyword">if</span> review <span class="keyword">and</span> review.get(<span class="string">&quot;text&quot;</span>):</span><br><span class="line">                    reviews.append(review)</span><br><span class="line">            <span class="keyword">except</span> Exception:</span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line">        <span class="keyword">return</span> reviews</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_extract_review</span>(<span class="params">self, card, source_url: <span class="built_in">str</span></span>) -&gt; Review:</span><br><span class="line">        review: Review = &#123;<span class="string">&quot;source&quot;</span>: <span class="string">&quot;trustpilot&quot;</span>, <span class="string">&quot;url&quot;</span>: source_url&#125;</span><br><span class="line">        <span class="comment"># Rating: e.g. span with data-rating or class containing &quot;star&quot;</span></span><br><span class="line">        rating_el = card.select_one(</span><br><span class="line">            <span class="string">&quot;[data-rating], &quot;</span></span><br><span class="line">            <span class="string">&quot;[data-review-rating], &quot;</span></span><br><span class="line">            <span class="string">&quot;.star-rating, &quot;</span></span><br><span class="line">            <span class="string">&#x27;[class*=&quot;star&quot;][class*=&quot;rating&quot;]&#x27;</span></span><br><span class="line">        )</span><br><span class="line">        <span class="keyword">if</span> rating_el:</span><br><span class="line">            rating_str = (</span><br><span class="line">                rating_el.get(<span class="string">&quot;data-rating&quot;</span>)</span><br><span class="line">                <span class="keyword">or</span> rating_el.get(<span class="string">&quot;data-review-rating&quot;</span>)</span><br><span class="line">                <span class="keyword">or</span> rating_el.get_text(strip=<span class="literal">True</span>)</span><br><span class="line">            )</span><br><span class="line">            <span class="keyword">if</span> rating_str:</span><br><span class="line">                <span class="keyword">match</span> = re.search(<span class="string">r&quot;(\d+(?:\.\d+)?)&quot;</span>, <span class="built_in">str</span>(rating_str))</span><br><span class="line">                <span class="keyword">if</span> <span class="keyword">match</span>:</span><br><span class="line">                    review[<span class="string">&quot;rating&quot;</span>] = <span class="built_in">float</span>(<span class="keyword">match</span>.group(<span class="number">1</span>))</span><br><span class="line">        <span class="comment"># Text: review body (Trustpilot uses reviewContent, review-content__body)</span></span><br><span class="line">        text_el = card.select_one(</span><br><span class="line">            <span class="string">&quot;[data-review-body], &quot;</span></span><br><span class="line">            <span class="string">&quot;.review-content__body, &quot;</span></span><br><span class="line">            <span class="string">&quot;.review-body, &quot;</span></span><br><span class="line">            <span class="string">&quot;p.review-content, &quot;</span></span><br><span class="line">            <span class="string">&quot;[class*=&#x27;reviewContent&#x27;], &quot;</span></span><br><span class="line">            <span class="string">&quot;[class*=&#x27;review-content&#x27;]&quot;</span></span><br><span class="line">        )</span><br><span class="line">        <span class="keyword">if</span> text_el:</span><br><span class="line">            review[<span class="string">&quot;text&quot;</span>] = text_el.get_text(separator=<span class="string">&quot; &quot;</span>, strip=<span class="literal">True</span>)</span><br><span class="line">        <span class="comment"># Date</span></span><br><span class="line">        date_el = card.select_one(</span><br><span class="line">            <span class="string">&quot;[data-review-created-at], &quot;</span></span><br><span class="line">            <span class="string">&quot;time[datetime], &quot;</span></span><br><span class="line">            <span class="string">&quot;.review-date&quot;</span></span><br><span class="line">        )</span><br><span class="line">        <span class="keyword">if</span> date_el:</span><br><span class="line">            review[<span class="string">&quot;date&quot;</span>] = (</span><br><span class="line">                date_el.get(<span class="string">&quot;datetime&quot;</span>)</span><br><span class="line">                <span class="keyword">or</span> date_el.get(<span class="string">&quot;data-review-created-at&quot;</span>)</span><br><span class="line">                <span class="keyword">or</span> date_el.get_text(strip=<span class="literal">True</span>)</span><br><span class="line">            )</span><br><span class="line">        <span class="comment"># Verified</span></span><br><span class="line">        verified_el = card.select_one(</span><br><span class="line">            <span class="string">&#x27;[class*=&quot;verified&quot;], &#x27;</span></span><br><span class="line">            <span class="string">&#x27;[data-consumer-name]&#x27;</span></span><br><span class="line">        )</span><br><span class="line">        review[<span class="string">&quot;verified&quot;</span>] = verified_el <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span></span><br><span class="line">        <span class="comment"># Helpful votes</span></span><br><span class="line">        helpful_el = card.select_one(</span><br><span class="line">            <span class="string">&quot;[data-review-helpful-count], &quot;</span></span><br><span class="line">            <span class="string">&#x27;[class*=&quot;helpful&quot;]&#x27;</span></span><br><span class="line">        )</span><br><span class="line">        <span class="keyword">if</span> helpful_el:</span><br><span class="line">            txt = helpful_el.get_text(strip=<span class="literal">True</span>) <span class="keyword">or</span> helpful_el.get(<span class="string">&quot;data-review-helpful-count&quot;</span>, <span class="string">&quot;&quot;</span>)</span><br><span class="line">            <span class="keyword">match</span> = re.search(<span class="string">r&quot;(\d+)&quot;</span>, <span class="built_in">str</span>(txt))</span><br><span class="line">            <span class="keyword">if</span> <span class="keyword">match</span>:</span><br><span class="line">                review[<span class="string">&quot;helpful_votes&quot;</span>] = <span class="built_in">int</span>(<span class="keyword">match</span>.group(<span class="number">1</span>))</span><br><span class="line">        <span class="keyword">return</span> review</span><br></pre></td></tr></table></figure><p>Full code: ScraperHub → <a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/parsers/trustpilot.py">parsers&#x2F;trustpilot.py</a></p><h2 id="Step-4-How-Do-You-Normalize-the-Review-Scraping-Data"><a href="#Step-4-How-Do-You-Normalize-the-Review-Scraping-Data" class="headerlink" title="Step 4: How Do You Normalize the Review Scraping Data?"></a>Step 4: How Do You Normalize the Review Scraping Data?</h2><p>By the time reviews are parsed, most fields are already structured. The parser extracts ratings, text, dates, and other attributes into a consistent format.</p><p>However, if you’re working across multiple platforms, you may need an additional normalization step.</p><p>Typical adjustments include:</p><ul><li>Converting ratings to a common scale (e.g., 1–10 → 1–5)</li><li>Parsing relative dates into a standard format</li><li>Aligning field names across sources</li></ul><p>For a single platform, the parser is usually enough. For multi-platform analysis, this step ensures your data stays comparable.</p><h2 id="Step-5-How-to-Store-Reviews-for-Data-Analysis"><a href="#Step-5-How-to-Store-Reviews-for-Data-Analysis" class="headerlink" title="Step 5: How to Store Reviews for Data Analysis"></a>Step 5: How to Store Reviews for Data Analysis</h2><p>Once reviews are parsed and normalized, you need to store them in a format that’s easy to process later.</p><p>A simple and practical choice is JSONL (JSON Lines). Each review is written as a single line, which makes it easy to stream into analytics tools or data pipelines.</p><p>The full implementation, including how this connects to the pipeline, is available on ScraperHub → <a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/storage.py">storage.py</a></p><p>If you plan to scale further, you can replace JSONL with a database or data warehouse later. The rest of the pipeline doesn’t need to change.</p><h2 id="Step-6-Scale-across-many-products"><a href="#Step-6-Scale-across-many-products" class="headerlink" title="Step 6: Scale across many products"></a>Step 6: Scale across many products</h2><p>The challenge starts when you need to scale or collect reviews across dozens or hundreds of URLs.</p><p>At this point, simple loops and local scripts become harder to manage. You need to deal with concurrency, retries, and request scheduling. This is where the <a href="https://crawlbase.com/docs/crawler/">Crawlbase Enterprise Crawler</a> comes in.</p><p>Instead of sending requests one by one, you push a list of URLs to Crawlbase, and the crawler processes them in the cloud. Each page is fetched, rendered, and delivered back to your system through a webhook.</p><p>Switching to this mode only requires a couple of parameters:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">params[<span class="string">&quot;callback&quot;</span>] = <span class="literal">True</span></span><br><span class="line">params[<span class="string">&quot;crawler&quot;</span>] = <span class="string">&quot;MyReviewCrawler&quot;</span> <span class="comment">#custom name for your crawler</span></span><br></pre></td></tr></table></figure><p>From there:</p><ul><li>URLs are processed in parallel</li><li>Failed requests are retried automatically</li><li>You receive results asynchronously via webhook</li></ul><p>You no longer need to manage queues or scaling logic in your own code.</p><p>This setup works well for:</p><ul><li>monitoring multiple products or brands</li><li>collecting large datasets for sentiment analysis</li><li>running scheduled review scrapes over time</li></ul><p>If you’re only working with a handful of pages, the Crawling API is enough. Once you move into larger datasets, the Enterprise Crawler removes most of the operational overhead.</p><h2 id="Full-Working-Implementation-ScraperHub"><a href="#Full-Working-Implementation-ScraperHub" class="headerlink" title="Full Working Implementation (ScraperHub)"></a>Full Working Implementation (ScraperHub)</h2><p>The examples in this guide focus on individual parts of the pipeline. If you want the complete project with simplified setup instructions, you can go straight to <a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/README.md">Review Scraper README</a>.</p><p>The repository includes the same pipeline covered in this guide, so it’s easy to map each step:</p><table><thead><tr><th>File</th><th>What it does</th></tr></thead><tbody><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/config.py">config.py</a></td><td>Handles configuration like API token, base URL, timeouts, and retries</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/models.py">models.py</a></td><td>Defines the structure of a review object (schema)</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/fetcher.py">fetcher.py</a></td><td>Sends requests through Crawlbase and retrieves rendered HTML</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/pagination.py">pagination.py</a></td><td>Generates and manages paginated URLs</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/storage.py">storage.py</a></td><td>Saves extracted reviews into JSONL format</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/parsers/base.py">parsers&#x2F;base.py</a></td><td>Base class for building custom review parsers</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/parsers/trustpilot.py">parsers&#x2F;trustpilot.py</a></td><td>Parser for extracting Trustpilot review data</td></tr><tr><td><a href="https://github.com/ScraperHub/how-to-scrape-customer-reviews/blob/main/main.py">main.py</a></td><td>Runs the full pipeline from fetch to parse to storage</td></tr></tbody></table><h2 id="What-are-the-Business-Applications-of-Review-Scraping"><a href="#What-are-the-Business-Applications-of-Review-Scraping" class="headerlink" title="What are the Business Applications of Review Scraping?"></a>What are the Business Applications of Review Scraping?</h2><p>Once reviews are structured, they become more than just text. You can actually use them to make decisions.</p><h3 id="Competitive-analysis"><a href="#Competitive-analysis" class="headerlink" title="Competitive analysis"></a>Competitive analysis</h3><p>Compare ratings and sentiment across competitors. Look beyond averages and identify what users consistently complain about or praise.</p><h3 id="Product-improvement"><a href="#Product-improvement" class="headerlink" title="Product improvement"></a>Product improvement</h3><p>Group negative feedback by topic. Patterns like shipping issues or product defects become obvious once you have enough data.</p><h3 id="Brand-monitoring"><a href="#Brand-monitoring" class="headerlink" title="Brand monitoring"></a>Brand monitoring</h3><p>Track rating trends over time. A sudden drop in reviews or spike in negative feedback usually points to a real issue.</p><h3 id="Sentiment-over-time"><a href="#Sentiment-over-time" class="headerlink" title="Sentiment over time"></a>Sentiment over time</h3><p>Run the scraper regularly and measure how perception changes. This helps you see if updates or fixes are actually improving user experience.</p><h2 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h2><p>Scraping customer reviews is not just about collecting text. It’s about building a structured pipeline that turns raw feedback into measurable insights.</p><p>Once you normalize review data across platforms, you can track sentiment, identify product gaps, and respond to market changes faster than competitors. The challenge is not extraction alone. It’s handling rendering, pagination, blocking, and scaling without constant maintenance.</p><p>Start simple:</p><ol><li>Create a free Crawlbase account</li><li>Fetch a review page using the Crawling API</li><li>Parse structured data using a platform-specific parser</li><li>Store results for analysis</li></ol><p>As your dataset grows, move to the Enterprise Crawler to handle thousands of URLs without managing infrastructure.</p><p><a href="https://crawlbase.com/signup?signup=blog">Create a free account</a> now to run your first review extraction.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">Try our <strong class="text-underline">Crawling API</strong>  to automate your data collection — used by 70k+ dev teams</p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get Started Now!" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'how-to-scrape-customer-reviews', 'cta_type': 'try_crawling_api', 'cta_position': 'top','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/how-to-scrape-customer-reviews/', 'page_title': 'How to Scrape Customer Reviews | Full Python Pipeline Guide' });">Get Started Now!</a>    </div>  </div>  </div><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="How-do-you-scrape-reviews-from-JavaScript-heavy-sites"><a href="#How-do-you-scrape-reviews-from-JavaScript-heavy-sites" class="headerlink" title="How do you scrape reviews from JavaScript-heavy sites?"></a>How do you scrape reviews from JavaScript-heavy sites?</h3><p>These platforms render content in the browser using frameworks like React. A standard HTTP request returns incomplete HTML.</p><p>You need a browser-based crawler. Crawlbase uses <a href="https://crawlbase.com/docs/crawling-api/headless-browsers/">JavaScript request</a> (via JavaScript token) that executes the page in a real browser session, ensuring all review content is fully loaded before extraction.</p><h3 id="Can-I-use-this-data-for-sentiment-analysis-and-machine-learning"><a href="#Can-I-use-this-data-for-sentiment-analysis-and-machine-learning" class="headerlink" title="Can I use this data for sentiment analysis and machine learning?"></a>Can I use this data for sentiment analysis and machine learning?</h3><p>Yes. Structured review data is commonly used for:</p><ul><li>Sentiment classification</li><li>Topic modeling</li><li>Feature extraction</li><li>Trend analysis over time</li></ul><p>Once reviews are normalized into a consistent schema, they can be fed directly into NLP pipelines or BI tools.</p><h3 id="Do-I-need-proxies-to-scrape-review-sites"><a href="#Do-I-need-proxies-to-scrape-review-sites" class="headerlink" title="Do I need proxies to scrape review sites?"></a>Do I need proxies to scrape review sites?</h3><p>If you build your own scraper, yes.</p><p>You will need:</p><ul><li>Rotating proxies</li><li>CAPTCHA handling</li><li>Browser automation</li></ul><p>Crawlbase removes this requirement by handling IP rotation, anti-bot mitigation, and rendering automatically, so you can focus on parsing and analysis instead of infrastructure.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;To scrape customer reviews at scale, you need to render JavaScript-heavy pages, systematically collect all review pages or scroll-loaded content, and extract key fields like rating, text, and date into a structured format.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="scrape amazon reviews" scheme="https://crawlbase.com/blog/tags/scrape-amazon-reviews/"/>
    
    <category term="scrape customer reviews" scheme="https://crawlbase.com/blog/tags/scrape-customer-reviews/"/>
    
    <category term="customer review scraping" scheme="https://crawlbase.com/blog/tags/customer-review-scraping/"/>
    
  </entry>
  
  <entry>
    <title>Smart Proxy vs AI Proxy: What Changed in Crawlbase&#39;s Upgrade</title>
    <link href="https://crawlbase.com/blog/smart-proxy-vs-ai-proxy/"/>
    <id>https://crawlbase.com/blog/smart-proxy-vs-ai-proxy/</id>
    <published>2026-03-26T09:57:10.000Z</published>
    <updated>2026-04-24T11:53:23.927Z</updated>
    
    <content type="html"><![CDATA[<p>The difference between AI proxy vs Smart Proxy comes down to how the system responds to modern anti-bot detection. Smart Proxy was built to solve IP-based blocking through managed IP pools and automatic rotation. For years, that approach worked reliably. But as anti-bot platforms evolved to analyze request fingerprints, session behavior, and timing patterns.</p><span id="more"></span><p><strong>Smart AI Proxy is the direct evolution of Smart Proxy</strong>. It keeps the same foundation, managed IP pools, automatic rotation, residential and datacenter coverage, and replaces the static rule layer with adaptive machine learning. The result is a proxy that responds to how targets actually behave, not just how they were configured to behave at setup.</p><p>This guide explains what specifically changed from Smart Proxy to <a href="https://crawlbase.com/smart-proxy">Smart AI Proxy</a>, and why those changes matter for production data collection, including why Smart AI Proxy outperforms competing tools that still rely on rule-based proxy architectures.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Stop Wrestling with Proxy Lists</h2>  </div>  <p class="banner-body">Our Smart AI Proxy uses machine learning to automatically rotate 1M+ residential and datacenter IPs, bypassing CAPTCHA and blocks for you.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="What-Smart-Proxy-Did-Well"><a href="#What-Smart-Proxy-Did-Well" class="headerlink" title="What Smart Proxy Did Well"></a>What Smart Proxy Did Well</h2><p>Smart Proxy solved the core problem of IP-based blocking reliably. By routing requests through a large rotating pool of residential and datacenter IPs, it made it difficult for targets to block traffic based on IP reputation alone.</p><p>For targets with basic defenses, static IP blocklists, simple rate limiting by IP, or minimal bot detection, Smart Proxy produced strong success rates with straightforward configuration. It was operationally simple: define your rotation rules, set retry logic, select geography, and run.</p><p>That simplicity was a genuine advantage. Teams could get high-volume data collection running quickly without deep proxy expertise, and it performed well against a broad range of targets.</p><h2 id="What-Smart-AI-Proxy-Changes"><a href="#What-Smart-AI-Proxy-Changes" class="headerlink" title="What Smart AI Proxy Changes"></a>What Smart AI Proxy Changes</h2><p><a href="https://crawlbase.com/blog/what-is-an-ai-proxy/">Smart AI Proxy</a> addresses each of these failure modes directly by replacing the static rule layer with three adaptive capabilities. For a full technical breakdown, see how <a href="https://crawlbase.com/blog/how-ai-proxies-work/">AI proxy technology works</a>.</p><h3 id="Adaptive-Request-Fingerprinting"><a href="#Adaptive-Request-Fingerprinting" class="headerlink" title="Adaptive Request Fingerprinting"></a>Adaptive Request Fingerprinting</h3><p>Instead of sending requests with a fixed fingerprint profile, Smart AI Proxy generates browser-realistic fingerprints and adapts them based on target feedback. If a fingerprint configuration starts triggering blocks, the system detects the pattern and rotates to a different profile automatically — without any manual intervention.</p><h3 id="Intelligent-Block-Handling"><a href="#Intelligent-Block-Handling" class="headerlink" title="Intelligent Block Handling"></a>Intelligent Block Handling</h3><p>Smart Proxy retried blocked requests by rotating the IP. Smart AI Proxy classifies the block type first. A CAPTCHA, a soft redirect, a honeypot response, and a rate limit are different failure signals — and they require different responses. Smart AI Proxy identifies what triggered the block and selects the appropriate counter-configuration: adjusting the fingerprint, cycling the session, changing the IP type, or modifying request timing.</p><h3 id="Automated-Session-Management"><a href="#Automated-Session-Management" class="headerlink" title="Automated Session Management"></a>Automated Session Management</h3><p>Smart AI Proxy manages session-level behavior to replicate realistic human browsing patterns — variable request timing, cookie state continuity, natural navigation sequences. This directly addresses behavioral fingerprinting, which Smart Proxy’s rule-based session handling wasn’t designed to counter.</p><h2 id="The-Operational-Difference-Between-Smart-Proxy-and-AI-Proxy"><a href="#The-Operational-Difference-Between-Smart-Proxy-and-AI-Proxy" class="headerlink" title="The Operational Difference Between Smart Proxy and AI Proxy"></a>The Operational Difference Between Smart Proxy and AI Proxy</h2><p>Beyond the technical capabilities, the upgrade changes the operational model significantly.</p><p>With Smart Proxy, operational overhead scaled with target complexity. More targets, more rules to maintain. A target updating its anti-bot stack meant a manual tuning cycle to restore performance. That burden fell on your engineering team, time spent on proxy configuration rather than on the data pipeline itself.</p><p>With Smart AI Proxy, the adaptive layer absorbs that work. The system builds per-target models continuously and updates them as targets change. Your team doesn’t need to identify when a target has updated its defenses, the proxy detects it from the shift in success rates and adjusts automatically.</p><p>The result: operational overhead stays low regardless of how many targets you’re running or how frequently they change.</p><h3 id="What-Stays-the-Same"><a href="#What-Stays-the-Same" class="headerlink" title="What Stays the Same"></a>What Stays the Same</h3><p>The upgrade to Smart AI Proxy doesn’t change the fundamentals of how you integrate with Crawlbase infrastructure. The same endpoint structure, the same IP pool coverage, the same geo-selection capabilities, these carry over. The intelligence layer sits behind the interface, not on top of it.</p><p>If you were running data collection workflows on Smart Proxy, the transition to Smart AI Proxy is an upgrade in capability, not a rebuild of your integration.</p><h2 id="Side-by-Side-Before-and-After-the-Upgrade"><a href="#Side-by-Side-Before-and-After-the-Upgrade" class="headerlink" title="Side-by-Side: Before and After the Upgrade"></a>Side-by-Side: Before and After the Upgrade</h2><table><thead><tr><th>Capability</th><th>Smart Proxy</th><th>Smart AI Proxy</th></tr></thead><tbody><tr><td>IP rotation</td><td>Rule-based</td><td>Adaptive</td></tr><tr><td>Request fingerprinting</td><td>Fixed</td><td>Dynamic, ML-driven</td></tr><tr><td>Block handling</td><td>Retry on IP rotation</td><td>Type-aware, intelligent response</td></tr><tr><td>Session management</td><td>Rule-based</td><td>Behavioral, human-realistic</td></tr><tr><td>Per-target optimization</td><td>Manual configuration</td><td>Automated model learning</td></tr><tr><td>Operational overhead at scale</td><td>Grows with target complexity</td><td>Stays low</td></tr><tr><td>Anti-bot platform performance</td><td>Degrades on hardened targets</td><td>Maintains high success rates</td></tr></tbody></table><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'smart-proxy-vs-ai-proxy', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="Most-Web-Scrapers-Are-Still-at-the-Smart-Proxy-Level"><a href="#Most-Web-Scrapers-Are-Still-at-the-Smart-Proxy-Level" class="headerlink" title="Most Web Scrapers Are Still at the Smart Proxy Level"></a>Most Web Scrapers Are Still at the Smart Proxy Level</h2><p>While Crawlbase has moved to Smart AI Proxy, most web scraping tools and proxy providers in the market are still offering what is effectively smart proxy technology — rule-based rotation, static fingerprinting, and manual configuration logic. The terminology varies: some call it “smart proxy,” others use labels like “premium residential proxy” or “managed rotating proxy”, but the underlying architecture is the same.</p><p>This matters practically. If you’re evaluating proxy providers based on feature lists, the gap isn’t always visible on the surface. The difference shows up in production: declining success rates against hardened targets, increasing manual configuration overhead, and failure modes that require engineering time to diagnose and fix.</p><p>Choosing <a href="https://crawlbase.com/signup?signup=blog">Crawlbase Smart AI Proxy</a> means choosing infrastructure that has already solved the problems that rule-based proxies consistently hit, not infrastructure that will require you to solve them manually as your target set grows.</p><h3 id="Why-the-AI-Proxy-Upgrade-Matters"><a href="#Why-the-AI-Proxy-Upgrade-Matters" class="headerlink" title="Why the AI Proxy Upgrade Matters"></a>Why the AI Proxy Upgrade Matters</h3><p>Smart Proxy was the right tool for the proxy landscape it was built for. Smart AI Proxy is built for the landscape that exists now, where <a href="https://www.forbes.com/councils/forbestechcouncil/2026/02/24/what-is-an-ai-proxy-how-it-works-and-key-use-cases/">IP rotation</a> alone isn’t sufficient, where behavioral analysis is standard, and where maintaining high success rates requires intelligence, not just infrastructure.</p><p>The upgrade isn’t a feature addition. It’s a fundamental change in how the proxy responds to the web, from following rules to learning from outcomes.</p><p>Explore <a href="https://crawlbase.com/smart-proxy">Crawlbase Smart AI Proxy</a> to see how it fits your data collection infrastructure.</p><h3 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h3><h3 id="Is-Smart-AI-Proxy-a-completely-different-product-from-Smart-Proxy"><a href="#Is-Smart-AI-Proxy-a-completely-different-product-from-Smart-Proxy" class="headerlink" title="Is Smart AI Proxy a completely different product from Smart Proxy?"></a>Is Smart AI Proxy a completely different product from Smart Proxy?</h3><p>No, it’s a direct upgrade. Smart AI Proxy is built on the same infrastructure as Smart Proxy, with an adaptive intelligence layer replacing the static rule-based logic. The integration interface and IP pool coverage carry over.</p><h3 id="Do-I-need-to-reconfigure-my-existing-Smart-Proxy-setup"><a href="#Do-I-need-to-reconfigure-my-existing-Smart-Proxy-setup" class="headerlink" title="Do I need to reconfigure my existing Smart Proxy setup?"></a>Do I need to reconfigure my existing Smart Proxy setup?</h3><p>The transition is designed to be minimal on the integration side. The adaptive layer operates behind the endpoint, so the changes in how the proxy behaves don’t require rebuilding your data pipeline.</p><h3 id="Does-Smart-AI-Proxy-cost-more"><a href="#Does-Smart-AI-Proxy-cost-more" class="headerlink" title="Does Smart AI Proxy cost more?"></a>Does Smart AI Proxy cost more?</h3><p>The pricing reflects the additional capability. The more relevant comparison for production workloads is total cost: Smart AI Proxy’s automated block handling and reduced configuration overhead typically save significant engineering time compared to manually maintaining Smart Proxy configurations against hardened targets.</p><h3 id="How-quickly-does-Smart-AI-Proxy-adapt-when-a-target-changes-its-anti-bot-stack"><a href="#How-quickly-does-Smart-AI-Proxy-adapt-when-a-target-changes-its-anti-bot-stack" class="headerlink" title="How quickly does Smart AI Proxy adapt when a target changes its anti-bot stack?"></a>How quickly does Smart AI Proxy adapt when a target changes its anti-bot stack?</h3><p>Adaptation is continuous. When a target updates its detection logic, the system identifies the change from the shift in success rates and begins adjusting automatically within the same session cycle, without manual intervention.</p><h3 id="Is-Smart-AI-Proxy-suitable-for-simpler-targets-too"><a href="#Is-Smart-AI-Proxy-suitable-for-simpler-targets-too" class="headerlink" title="Is Smart AI Proxy suitable for simpler targets, too?"></a>Is Smart AI Proxy suitable for simpler targets, too?</h3><p>Yes. The adaptive layer adds capability without adding complexity on the user side. Whether your targets are simple or hardened, Smart AI Proxy handles the configuration layer automatically.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;The difference between AI proxy vs Smart Proxy comes down to how the system responds to modern anti-bot detection. Smart Proxy was built to solve IP-based blocking through managed IP pools and automatic rotation. For years, that approach worked reliably. But as anti-bot platforms evolved to analyze request fingerprints, session behavior, and timing patterns.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="smart proxy vs ai proxy" scheme="https://crawlbase.com/blog/tags/smart-proxy-vs-ai-proxy/"/>
    
    <category term="crawlbase smart ai proxy" scheme="https://crawlbase.com/blog/tags/crawlbase-smart-ai-proxy/"/>
    
    <category term="ai-powered proxy" scheme="https://crawlbase.com/blog/tags/ai-powered-proxy/"/>
    
    <category term="ai proxies" scheme="https://crawlbase.com/blog/tags/ai-proxies/"/>
    
  </entry>
  
  <entry>
    <title>Build a Website Change Tracker With Crawlbase and Python</title>
    <link href="https://crawlbase.com/blog/how-to-build-website-change-tracker-with-python/"/>
    <id>https://crawlbase.com/blog/how-to-build-website-change-tracker-with-python/</id>
    <published>2026-03-11T20:15:50.000Z</published>
    <updated>2026-04-24T11:53:23.231Z</updated>
    
    <content type="html"><![CDATA[<p><strong>In one sentence:</strong> This tutorial shows you how to build a Python website monitoring script that fetches pages via Crawlbase, generates SHA-256 content fingerprints, and alerts you when anything changes; no proxy infrastructure required.</p><p>To build a website change tracker, the easiest approach is to compare it with a previous version. A script fetches the page, extracts the relevant text, and generates a fingerprint from that content. The next time it runs, it performs the same steps again and checks whether the fingerprint still matches. If it does not, something on the page changed.</p><span id="more"></span><p>We will build a Python script that handles this workflow in this tutorial. It retrieves page HTML through the <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawlbase Crawling API</a>, extracts the readable text from the page, and generates a SHA-256 hash of the cleaned content. That hash is stored locally so the script can compare it the next time the page is checked.</p><p>By the end, you’ll have a working change tracker that can monitor one or several URLs, store snapshots, output structured results, and run automatically on a schedule.</p><h2 id="How-Website-Change-Tracking-Works"><a href="#How-Website-Change-Tracking-Works" class="headerlink" title="How Website Change Tracking Works"></a>How Website Change Tracking Works</h2><p>Website change tracking follows a repeatable six-step pipeline that converts raw page content into a comparable signal.</p><ul><li><strong>Step 1 — Fetch the page content.</strong> Retrieve the full HTML of the target URL. Using a reliable API like Crawlbase avoids blocks and ensures JavaScript-rendered content is included.</li><li><strong>Step 2 — Extract the part of the page you want to monitor.</strong> Strip out navigation, scripts, footers, and ads. You want only the meaningful body text.</li><li><strong>Step 3 — Normalize the text.</strong> Collapse whitespace, remove formatting artifacts, and standardize encoding so that cosmetic changes don’t trigger false positives.</li><li><strong>Step 4 — Generate a content fingerprint.</strong> A content fingerprint is a fixed-length cryptographic hash (SHA-256 in this tutorial) derived from the cleaned page text. Even a single word change produces a completely different hash, making fingerprints a fast and storage-efficient way to detect updates.</li><li><strong>Step 5 — Compare with the stored fingerprint.</strong> Load the fingerprint saved from the last run and compare it to the one you just generated. If they differ, the page has changed.</li><li><strong>Step 6 — Record or report the result.</strong> Save the new fingerprint for the next run and optionally emit a diff showing exactly what changed.</li></ul><p>The main challenge is avoiding false positives. Raw HTML often includes elements that change frequently, such as scripts, advertisements, timestamps, or dynamic widgets. Comparing cleaned text instead of raw HTML produces more accurate results.</p><h2 id="Why-Use-Crawlbase-for-Page-Tracking"><a href="#Why-Use-Crawlbase-for-Page-Tracking" class="headerlink" title="Why Use Crawlbase for Page Tracking"></a>Why Use Crawlbase for Page Tracking</h2><p>You could build a tracking script using <a href="https://www.codecademy.com/article/what-is-http">direct HTTP requests</a>, but many websites block or throttle automated requests. Some pages also rely heavily on JavaScript, meaning the raw HTML returned by a standard request may not contain the actual content.</p><p>Crawlbase solves these problems by handling page retrieval for you.</p><p>Key advantages include:</p><ul><li><strong>Reliable page retrieval</strong> across a wide range of websites</li><li><strong>Built-in handling</strong> for blocking, throttling, and CAPTCHAs</li><li><strong>JavaScript rendering</strong> via the JS token ,</li><li><strong>No proxy infrastructure</strong> to manage or maintain</li></ul><p><strong>Consistent HTML output</strong> that’s suitable for repeatable comparison</p><p>Your monitoring script focuses only on extracting and comparing content while Crawlbase acts as the retrieval layer.</p><h2 id="Prerequisites-and-Technical-Requirements"><a href="#Prerequisites-and-Technical-Requirements" class="headerlink" title="Prerequisites and Technical Requirements"></a>Prerequisites and Technical Requirements</h2><p>Before starting, make sure your environment includes the following.</p><p><strong>Environment requirements:</strong></p><table><thead><tr><th>Requirement</th><th>Detail</th></tr></thead><tbody><tr><td><a href="https://www.python.org/">Python</a></td><td>version 3.10 or later</td></tr><tr><td><a href="https://crawlbase.com/dashboard/account/docs">Crawlbase API token</a></td><td>Free tier includes 1,000 requests</td></tr><tr><td>Operating system</td><td>Linux, macOS, or Windows</td></tr></tbody></table><p>The tutorial uses these Python packages:</p><table><thead><tr><th>Package</th><th>Purpose</th></tr></thead><tbody><tr><td><a href="https://pypi.org/project/requests/">requests</a></td><td>HTTP requests to the Crawlbase API</td></tr><tr><td><a href="https://beautiful-soup-4.readthedocs.io/en/latest/">beautifulsoup4</a></td><td>HTML parsing and text extraction</td></tr><tr><td><a href="https://www.w3schools.com/python/ref_module_hashlib.asp">hashlib</a></td><td>SHA-256 fingerprint generation</td></tr><tr><td><a href="https://www.w3schools.com/python/python_json.asp">json</a></td><td>Local snapshot storage</td></tr><tr><td><a href="https://www.w3schools.com/python/ref_module_difflib.asp">difflib</a></td><td>Generating human-readable diffs</td></tr></tbody></table><h3 id="Step-1-Install-Dependencies"><a href="#Step-1-Install-Dependencies" class="headerlink" title="Step 1: Install Dependencies"></a>Step 1: Install Dependencies</h3><p>From the project directory, download <a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/requirements.txt">requirements.txt</a>, and run:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install -r requirements.txt</span><br></pre></td></tr></table></figure><p>This will install dependencies such as requests (v2.28.0) and beautifulsoup4 (v4.11.0).</p><h3 id="Step-2-Fetch-a-Web-Page-Using-Crawlbase"><a href="#Step-2-Fetch-a-Web-Page-Using-Crawlbase" class="headerlink" title="Step 2: Fetch a Web Page Using Crawlbase"></a>Step 2: Fetch a Web Page Using Crawlbase</h3><p>The next step is verifying that you can retrieve the page HTML successfully.</p><p>The script sends a request to the Crawlbase Crawling API and returns the response content.</p><p>Get the complete code example on ScraperHub - <a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/fetch.py">fetch.py</a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">fetch_page</span>(<span class="params">url: <span class="built_in">str</span>, token: <span class="built_in">str</span> | <span class="literal">None</span> = <span class="literal">None</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    api_token = token <span class="keyword">or</span> os.environ.get(<span class="string">&quot;CRAWLBASE_TOKEN&quot;</span>, <span class="string">&quot;&quot;</span>)</span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> api_token:</span><br><span class="line">        <span class="keyword">raise</span> ValueError(<span class="string">&quot;Crawlbase token required: set CRAWLBASE_TOKEN or pass token=&quot;</span>)</span><br><span class="line">    api_url = <span class="string">f&quot;<span class="subst">&#123;CRAWLBASE_API_URL&#125;</span>/?token=<span class="subst">&#123;api_token&#125;</span>&amp;url=<span class="subst">&#123;quote(url)&#125;</span>&quot;</span></span><br><span class="line">    response = requests.get(api_url, timeout=<span class="number">30</span>)</span><br><span class="line">    response.raise_for_status()</span><br><span class="line">    <span class="keyword">return</span> response.text</span><br></pre></td></tr></table></figure><p>This function:</p><p>• Reads the Crawlbase token<br>• Sends the target URL to the Crawling API<br>• Retrieves the page HTML<br>• Returns the content for processing</p><p>Using Crawlbase ensures the monitoring and tracking tool receives reliable HTML output.</p><h3 id="Step-3-Extract-the-Content-to-Track"><a href="#Step-3-Extract-the-Content-to-Track" class="headerlink" title="Step 3: Extract the Content to Track"></a>Step 3: Extract the Content to Track</h3><p>Comparing raw HTML is unreliable because pages contain many elements that change frequently.</p><p>To reduce noise, the script extracts readable page text and removes unnecessary elements.</p><p>Code example on ScraperHub - <a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/extract.py">extract.py</a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">extract_monitorable_text</span>(<span class="params">html: <span class="built_in">str</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    soup = BeautifulSoup(html, <span class="string">&quot;html.parser&quot;</span>)</span><br><span class="line">    <span class="keyword">for</span> tag <span class="keyword">in</span> soup([<span class="string">&quot;script&quot;</span>, <span class="string">&quot;style&quot;</span>, <span class="string">&quot;nav&quot;</span>, <span class="string">&quot;footer&quot;</span>]):</span><br><span class="line">        tag.decompose()</span><br><span class="line">    text = soup.get_text(separator=<span class="string">&quot; &quot;</span>, strip=<span class="literal">True</span>)</span><br><span class="line">    <span class="keyword">return</span> <span class="string">&quot; &quot;</span>.join(text.split())</span><br></pre></td></tr></table></figure><p>This function performs several steps:</p><p>• Removes scripts and styles<br>• Removes navigation and footer elements<br>• Extracts readable text<br>• Normalizes whitespace</p><p>The result is a consistent text representation of the page content.</p><h3 id="Step-4-Generate-a-Content-Fingerprint"><a href="#Step-4-Generate-a-Content-Fingerprint" class="headerlink" title="Step 4: Generate a Content Fingerprint"></a>Step 4: Generate a Content Fingerprint</h3><p>Instead of storing entire page snapshots, the tool generates a fingerprint using a cryptographic hash.</p><p>A hash converts text into a fixed-length string. If the content changes, the hash changes as well.</p><p>Example (<a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/fingerprint.py">fingerprint.py</a>):</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">content_fingerprint</span>(<span class="params">text: <span class="built_in">str</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    <span class="keyword">return</span> hashlib.sha256(text.encode(<span class="string">&quot;utf-8&quot;</span>)).hexdigest()</span><br></pre></td></tr></table></figure><p>This creates a <strong>SHA-256 fingerprint</strong> of the cleaned text.</p><p>Benefits of using hashes:</p><p>• Fast comparison<br>• Minimal storage requirements<br>• Reliable detection of small changes</p><p>Even a small change to the text will produce a different hash.</p><h3 id="Step-5-Store-Previous-Snapshots"><a href="#Step-5-Store-Previous-Snapshots" class="headerlink" title="Step 5: Store Previous Snapshots"></a>Step 5: Store Previous Snapshots</h3><p>To detect updates, the tool must remember the fingerprints from previous runs.</p><p>This will store two snapshot files:</p><ul><li><p><strong>snapshots.json -</strong> Stores URL → fingerprint mappings.</p></li><li><p><strong>snapshots_text.json -</strong> Stores the normalized text for each page so differences can be shown when content changes.</p></li></ul><p>Example (<a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/storage.py">storage.py</a>):</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">load_snapshots</span>(<span class="params">path: <span class="built_in">str</span> | Path</span>) -&gt; <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">str</span>]:</span><br><span class="line">    p = Path(path)</span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> p.exists():</span><br><span class="line">        <span class="keyword">return</span> &#123;&#125;</span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(p, encoding=“utf-<span class="number">8</span>”) <span class="keyword">as</span> f:</span><br><span class="line">        <span class="keyword">return</span> json.load(f)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">save_snapshots</span>(<span class="params">snapshots: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">str</span>], path: <span class="built_in">str</span> | Path</span>) -&gt; <span class="literal">None</span>:</span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(path, “w”, encoding=“utf-<span class="number">8</span>”) <span class="keyword">as</span> f:</span><br><span class="line">        json.dump(snapshots, f, indent=<span class="number">2</span>)</span><br></pre></td></tr></table></figure><p>When the monitor runs again, it loads the stored fingerprints and compares them with the newly generated ones.</p><h3 id="Step-6-Compare-Current-vs-Previous-Version"><a href="#Step-6-Compare-Current-vs-Previous-Version" class="headerlink" title="Step 6: Compare Current vs Previous Version"></a>Step 6: Compare Current vs Previous Version</h3><p>Once the current fingerprint is generated, the script compares it with the stored fingerprint.</p><p>Example (<a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/monitor.py">monitor.py</a>):</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">check_for_change</span>(<span class="params">url: <span class="built_in">str</span>, current_hash: <span class="built_in">str</span>, snapshots: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">str</span>]</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    previous = snapshots.get(url)</span><br><span class="line">    <span class="keyword">if</span> previous <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> previous != current_hash</span><br></pre></td></tr></table></figure><p>If the fingerprints are different, the script reports a change.</p><p>Possible results:</p><p><strong>Changed</strong><br><strong>No change</strong></p><p>The first time a URL is checked, the script always reports <strong>Changed</strong> because no previous snapshot exists yet. The current fingerprint and page text are then stored for future comparisons.</p><p>When the page content changes, the tool also generates a <strong>unified diff</strong> showing what changed. Example output might look like this:</p><figure class="highlight diff"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">--- previous</span></span><br><span class="line"><span class="comment">+++ current</span></span><br><span class="line"><span class="deletion">- Old sentence</span></span><br><span class="line"><span class="addition">+ New sentence</span></span><br></pre></td></tr></table></figure><p>This diff is generated using Python’s <code>difflib</code> module and helps identify exactly what changed between page versions.</p><h3 id="Step-7-Save-Updated-Snapshot"><a href="#Step-7-Save-Updated-Snapshot" class="headerlink" title="Step 7: Save Updated Snapshot"></a>Step 7: Save Updated Snapshot</h3><p>After checking for changes, the script updates the stored snapshot so future runs can detect new updates.</p><p>In <a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/monitor.py">monitor.py</a>, the script stores both the fingerprint and the extracted text.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">snapshots[url] = fingerprint</span><br><span class="line">snapshot_texts[url] = text</span><br><span class="line">save_snapshots(snapshots, path)</span><br><span class="line">save_snapshot_texts(snapshot_texts, path)</span><br></pre></td></tr></table></figure><p>Saving both values allows the tool to detect future changes and generate readable diffs.</p><h3 id="Step-8-Run-the-Monitor-on-a-Schedule"><a href="#Step-8-Run-the-Monitor-on-a-Schedule" class="headerlink" title="Step 8: Run the Monitor on a Schedule"></a>Step 8: Run the Monitor on a Schedule</h3><p>Monitoring tools are most useful when they run automatically.</p><p>Several scheduling options are available with common approaches such as:</p><ul><li>Cron jobs on Linux or macOS</li><li>Windows Task Scheduler</li><li>Cloud-based job schedulers</li></ul><p>This tool also supports built-in interval scheduling.</p><p>Example CLI configuration in <a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/main.py">main.py</a>:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">parser.add_argument(<span class="string">&quot;--interval&quot;</span>, <span class="built_in">type</span>=<span class="built_in">float</span>, metavar=<span class="string">&quot;SECONDS&quot;</span>,</span><br><span class="line">    <span class="built_in">help</span>=<span class="string">&quot;Run continuously: re-check all URLs every SECONDS (e.g. 3600 for hourly). Ctrl+C to stop.&quot;</span>)</span><br><span class="line"><span class="comment"># ...</span></span><br><span class="line"><span class="keyword">while</span> <span class="literal">True</span>:</span><br><span class="line">    results = run_once(args.url, args.snapshots, args.json)</span><br><span class="line">    <span class="keyword">if</span> args.interval <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">        <span class="keyword">break</span></span><br><span class="line">    time.sleep(args.interval)</span><br></pre></td></tr></table></figure><p>Example usage:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py https://example.com --interval 3600</span><br></pre></td></tr></table></figure><h2 id="Full-Working-Script"><a href="#Full-Working-Script" class="headerlink" title="Full Working Script"></a>Full Working Script</h2><p>The complete implementation combines all components into a modular monitoring and tracking tool.</p><p><a href="https://github.com/ScraperHub/website-change-monitoring">ScraperHub Repository</a> layout:</p><table><thead><tr><th>File</th><th>Role</th></tr></thead><tbody><tr><td><a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/fetch.py">fetch.py</a></td><td>Fetch HTML using Crawlbase</td></tr><tr><td><a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/extract.py">extract.py</a></td><td>Clean HTML and normalize text</td></tr><tr><td><a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/fingerprint.py">fingerprint.py</a></td><td>Generate SHA-256 fingerprint</td></tr><tr><td><a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/storage.py">storage.py</a></td><td>Load and store snapshot data</td></tr><tr><td><a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/monitor.py">monitor.py</a></td><td>Compare snapshots and detect changes</td></tr><tr><td><a href="https://github.com/ScraperHub/website-change-monitoring/blob/main/main.py">main.py</a></td><td>CLI entry point and scheduler</td></tr></tbody></table><h3 id="How-to-Run-the-Script"><a href="#How-to-Run-the-Script" class="headerlink" title="How to Run the Script"></a>How to Run the Script</h3><p>Set your Crawlbase token first.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">export</span> CRAWLBASE_TOKEN=<span class="string">&quot;your_token&quot;</span></span><br></pre></td></tr></table></figure><p>Then run the script.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py https://targeturl.com</span><br></pre></td></tr></table></figure><p>To monitor multiple pages:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python main.py https://targeturl1.com https://targeturl2.com ...</span><br></pre></td></tr></table></figure><p>The first run always reports <strong>Changed</strong>, since no snapshot exists yet.</p><h2 id="Error-Handling-Strategies"><a href="#Error-Handling-Strategies" class="headerlink" title="Error Handling Strategies"></a>Error Handling Strategies</h2><p>A production Python website monitoring script needs to handle three common failure modes gracefully.</p><ul><li><strong>Network timeouts:</strong> The <code>requests.get(timeout=30)</code> call raises <code>requests.exceptions.Timeout</code> if the Crawlbase API does not respond within 30 seconds. Wrap fetch calls in a try&#x2F;except and implement exponential backoff for retries:</li></ul><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> time</span><br><span class="line">​<span class="keyword">def</span> <span class="title function_">fetch_with_retry</span>(<span class="params">url: <span class="built_in">str</span>, token: <span class="built_in">str</span>, retries: <span class="built_in">int</span> = <span class="number">3</span>, backoff: <span class="built_in">float</span> = <span class="number">2.0</span></span>) -&gt; <span class="built_in">str</span>:</span><br><span class="line">    <span class="keyword">for</span> attempt <span class="keyword">in</span> <span class="built_in">range</span>(retries):</span><br><span class="line">        <span class="keyword">try</span>:</span><br><span class="line">            <span class="keyword">return</span> fetch_page(url, token)</span><br><span class="line">        <span class="keyword">except</span> requests.exceptions.Timeout:</span><br><span class="line">            <span class="keyword">if</span> attempt &lt; retries - <span class="number">1</span>:</span><br><span class="line">                time.sleep(backoff ** attempt)</span><br><span class="line">            <span class="keyword">else</span>:</span><br><span class="line">                <span class="keyword">raise</span></span><br></pre></td></tr></table></figure><p>​</p><ul><li><strong>HTTP errors:</strong> <code>response.raise_for_status()</code> surfaces 4xx&#x2F;5xx responses as exceptions. Log the status code and URL, then skip the affected URL rather than halting the entire run.</li><li><strong>Malformed HTML:</strong> BeautifulSoup handles most broken HTML gracefully, but extremely malformed pages can produce empty text. Add a check after extraction: if <code>extract_monitorable_text()</code> returns an empty string, skip the fingerprint comparison and log a warning rather than recording a spurious change.</li></ul><h2 id="Scaling-the-Tool-for-Multiple-URLs"><a href="#Scaling-the-Tool-for-Multiple-URLs" class="headerlink" title="Scaling the Tool for Multiple URLs"></a>Scaling the Tool for Multiple URLs</h2><p>The tutorial focuses on a minimal implementation, but the system can be extended for larger monitoring workloads.</p><p>Possible improvements include:</p><p>• Tracking many pages simultaneously<br>• Parallel request processing<br>• Using databases instead of JSON storage<br>• Adding structured logging and retries</p><p>These changes make the system more robust for production environments.</p><h2 id="Limitations-and-Best-Practices"><a href="#Limitations-and-Best-Practices" class="headerlink" title="Limitations and Best Practices"></a>Limitations and Best Practices</h2><p>A simple change tracker works well for many pages, but real websites can introduce a few complications.</p><h3 id="Dynamic-content"><a href="#Dynamic-content" class="headerlink" title="Dynamic content"></a>Dynamic content</h3><p>Some sites load content with JavaScript after the initial page request. If the part of the page you want to track is generated this way, a normal Crawlbase request may not return the full content. In that case, switch to the <a href="https://crawlbase.com/docs/crawling-api/headless-browsers/">Crawlbase JavaScript token</a> so the page is rendered before the HTML is returned.</p><h3 id="Authentication"><a href="#Authentication" class="headerlink" title="Authentication"></a>Authentication</h3><p>For pages that require a login, the request must include valid session cookies.</p><p><strong>Fix:</strong> Pass authenticated cookies via the Crawlbase <a href="https://crawlbase.com/docs/crawling-api/parameters/#set-cookies">cookies</a> parameter so the crawler accesses the logged-in version of the page.</p><h3 id="Rate-limits"><a href="#Rate-limits" class="headerlink" title="Rate limits"></a>Rate limits</h3><ul><li>Default limit: <strong>20 requests per second</strong></li><li>For most monitoring workloads, this is sufficient</li><li>Contact <a href="https://crawlbase.com/dashboard/support">Crawlbase support</a> to request a higher limit for large-scale jobs</li></ul><h3 id="Monitoring-intervals"><a href="#Monitoring-intervals" class="headerlink" title="Monitoring intervals"></a>Monitoring intervals</h3><p>Choose check frequency based on how often the page actually changes:</p><ul><li><strong>News sites&#x2F;dashboards:</strong> every 15–60 minutes</li><li><strong>Product listings&#x2F;pricing:</strong> every 1–6 hours</li><li><strong>Policy pages&#x2F;documentation:</strong> daily or weekly</li></ul><p>Running checks too frequently adds request costs without improving detection accuracy.</p><h2 id="What’s-Next"><a href="#What’s-Next" class="headerlink" title="What’s Next"></a>What’s Next</h2><p>With the script from this guide, you already have a working website change tracker built with Python and Crawlbase. From here, you can extend it depending on your needs. For example, you could add alert notifications, store results in a database, or monitor a larger list of URLs in parallel.</p><p>If you want to try it yourself, create a <a href="https://crawlbase.com/signup?signup=blog">Crawlbase account and use the 1,000 free requests</a> to test the tracker and start monitoring and tracking pages right away.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="Can-this-monitor-multiple-pages-at-once"><a href="#Can-this-monitor-multiple-pages-at-once" class="headerlink" title="Can this monitor multiple pages at once?"></a>Can this monitor multiple pages at once?</h3><p>Yes. Pass multiple URLs to the CLI: python main.py <code>https://site1.com</code> <code>https://site2.com</code>. The script processes them sequentially by default; enable parallel processing with <code>ThreadPoolExecutor</code> for faster runs across large URL lists.</p><p>How often should checks run?</p><p>It depends on how frequently the content changes. Hourly is a reasonable default for most monitoring use cases. High-frequency pages (live scores, breaking news) may warrant checks every 10–15 minutes. Static documentation pages are fine with daily checks.</p><h3 id="Does-it-work-on-JavaScript-heavy-websites"><a href="#Does-it-work-on-JavaScript-heavy-websites" class="headerlink" title="Does it work on JavaScript-heavy websites?"></a>Does it work on JavaScript-heavy websites?</h3><p>Yes, with one configuration change. Use the Crawlbase JavaScript token instead of the standard token. This renders the full page in a headless browser before returning HTML, ensuring dynamic content is captured.</p><h3 id="Can-it-send-alerts-when-something-changes"><a href="#Can-it-send-alerts-when-something-changes" class="headerlink" title="Can it send alerts when something changes?"></a>Can it send alerts when something changes?</h3><p>The core script outputs change results to stdout and optionally to a JSON file. Integrating alerts requires a small extension — call an email API (e.g., SendGrid), post to a Slack webhook, or trigger any HTTP endpoint when <code>check_for_change()</code> returns <code>True</code>.</p><h3 id="What’s-the-best-storage-option-for-tracking-hundreds-of-URLs"><a href="#What’s-the-best-storage-option-for-tracking-hundreds-of-URLs" class="headerlink" title="What’s the best storage option for tracking hundreds of URLs?"></a>What’s the best storage option for tracking hundreds of URLs?</h3><p>Replace the default JSON files with SQLite using the <code>sqlite3</code> standard library module. It handles concurrent reads, scales to large URL lists, and keeps all state in a single portable file. See the Scaling section above for a ready-to-use implementation.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;&lt;strong&gt;In one sentence:&lt;/strong&gt; This tutorial shows you how to build a Python website monitoring script that fetches pages via Crawlbase, generates SHA-256 content fingerprints, and alerts you when anything changes; no proxy infrastructure required.&lt;/p&gt;
&lt;p&gt;To build a website change tracker, the easiest approach is to compare it with a previous version. A script fetches the page, extracts the relevant text, and generates a fingerprint from that content. The next time it runs, it performs the same steps again and checks whether the fingerprint still matches. If it does not, something on the page changed.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="Python website change tracker" scheme="https://crawlbase.com/blog/tags/Python-website-change-tracker/"/>
    
    <category term="how to track website changes with Python" scheme="https://crawlbase.com/blog/tags/how-to-track-website-changes-with-Python/"/>
    
    <category term="monitor website changes Python" scheme="https://crawlbase.com/blog/tags/monitor-website-changes-Python/"/>
    
    <category term="website change tracking Python" scheme="https://crawlbase.com/blog/tags/website-change-tracking-Python/"/>
    
    <category term="web page change detection Python" scheme="https://crawlbase.com/blog/tags/web-page-change-detection-Python/"/>
    
  </entry>
  
  <entry>
    <title>How to Build a Scalable Web Data Pipeline With Crawlbase</title>
    <link href="https://crawlbase.com/blog/how-to-build-scalable-web-data-pipeline/"/>
    <id>https://crawlbase.com/blog/how-to-build-scalable-web-data-pipeline/</id>
    <published>2026-03-06T20:13:39.000Z</published>
    <updated>2026-04-24T11:53:23.227Z</updated>
    
    <content type="html"><![CDATA[<p>Building a scalable web data pipeline with Crawlbase involves using the <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawling API</a> for real-time page retrieval and the <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">Enterprise Crawler</a> for automated large-scale collection, then feeding the results into an ETL system for parsing, transformation, and storage. This removes the need to manage proxies, IP rotation, or JavaScript rendering infrastructure internally, allowing reliable web data collection with your pipeline even as target sites change or deploy anti-bot defenses.</p><span id="more"></span><p>In this setup, Crawlbase sits as an ingestion layer at the front of the pipeline. It deals with blocked requests, JavaScript-heavy pages, and changing site behavior, while your internal systems handle transformation, validation, and analytics.</p><p>This guide walks through how to assemble a production-ready pipeline using Crawlbase for acquisition and standard ETL tools for downstream processing, whether you are monitoring a handful of pages or continuously ingesting data at scale.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Scale Without the Speed Wobbles</h2>  </div>  <p class="banner-body">In recent benchmarks, Crawlbase maintained consistent response times even as request volume quintupled. Whether you're running 2 or 10 req/s, we provide the steady performance your data pipeline needs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Build a Scalable Scraper">Build a Scalable Scraper</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Why-Web-Data-Acquisition-Breaks-Pipelines"><a href="#Why-Web-Data-Acquisition-Breaks-Pipelines" class="headerlink" title="Why Web Data Acquisition Breaks Pipelines"></a>Why Web Data Acquisition Breaks Pipelines</h2><p>When a data pipeline starts producing gaps or stale numbers, the root cause is often the collection layer, not the analytics stack. If the input is unreliable, no amount of downstream processing can fix it.</p><p>Typical pain points look like this:</p><ul><li>A scraper that worked yesterday stops working after a site redesign.</li><li>Requests start getting throttled or blocked, sometimes with CAPTCHA.</li><li>When traffic patterns look automated, IP addresses lose reputation over time. Something providers like <a href="https://www.cloudflare.com/">Cloudflare</a> actively monitor.</li><li>Pages load fine in a browser but return almost nothing to a basic HTTP request because the content is rendered with JavaScript.</li><li>Jobs succeed technically but store empty or partial data, which is harder to detect than a hard failure.</li></ul><p>The core problem is that external websites are not stable dependencies. They evolve constantly. A small layout tweak, a new experiment, or a backend optimization can change how content is delivered. Large platforms such as <a href="https://crawlbase.com/google-serp-scraper">Google</a> and <a href="https://crawlbase.com/amazon-scraper">Amazon</a> push changes frequently, and those changes rarely consider third-party data extraction.</p><p>As the number of target sites increases, so does the maintenance. Each source has its own quirks, failure modes, and update cycle. What starts as a straightforward scraping task can quietly turn into ongoing operational work.</p><p>For pipelines that rely on web data, the safest approach is to treat acquisition as infrastructure that needs to tolerate change, not as a one-off script that will keep working indefinitely.</p><h2 id="Where-Crawlbase-Fits-in-a-Modern-Data-Stack"><a href="#Where-Crawlbase-Fits-in-a-Modern-Data-Stack" class="headerlink" title="Where Crawlbase Fits in a Modern Data Stack"></a>Where Crawlbase Fits in a Modern Data Stack</h2><p>Crawlbase acts as the web data ingestion layer at the very beginning of the pipeline. It retrieves pages reliably while handling the complexities that typically break scrapers.</p><p>Crawlbase manages:</p><ul><li>Page retrieval across diverse websites</li><li>JavaScript rendering for dynamic content</li><li>IP rotation and request routing</li><li>Block mitigation and reliability at scale</li><li>Large-scale crawl execution</li></ul><p>Your data systems handle:</p><ul><li>Parsing and transformation</li><li>Data quality validation</li><li>Storage and analytics</li><li>Business logic and consumption</li></ul><h3 id="A-typical-data-pipeline-architecture-looks-like-this"><a href="#A-typical-data-pipeline-architecture-looks-like-this" class="headerlink" title="A typical data pipeline architecture looks like this:"></a>A typical data pipeline architecture looks like this:</h3><p>Web → Crawlbase → ETL → Data Warehouse → BI &#x2F; ML Systems</p><img src="/blog/how-to-build-scalable-web-data-pipeline/crawlbase-pipeline-architecture.jpg" class="" title="Web Data Pipeline With Crawlbase Architecture " alt="Web Data Pipeline With Crawlbase Architecture"><p><em>Crawlbase sits between the web and your ETL layer. The <a href="https://crawlbase.com/docs/crawling-api/">Crawling API</a> handles on-demand extraction; the Enterprise Crawler handles batch and discovery. Both feed into your pipeline, which loads clean data into warehouses, BI, and ML systems.</em></p><p>This separation of concerns is critical. It allows data engineers to focus on processing and modeling rather than network-level scraping challenges.</p><h2 id="What-Are-the-Two-Ways-to-Extract-Web-Data-With-Crawlbase"><a href="#What-Are-the-Two-Ways-to-Extract-Web-Data-With-Crawlbase" class="headerlink" title="What Are the Two Ways to Extract Web Data With Crawlbase?"></a>What Are the Two Ways to Extract Web Data With Crawlbase?</h2><p>Different workloads require different extraction approaches. Crawlbase provides two complementary tools designed for distinct pipeline patterns, which are:</p><ol><li>Crawling API for real-time and on-demand data extraction</li><li>Crawler for enterprise-grade crawling</li></ol><h3 id="Crawling-API-Real-Time-On-Demand-Extraction"><a href="#Crawling-API-Real-Time-On-Demand-Extraction" class="headerlink" title="Crawling API: Real-Time, On-Demand Extraction"></a>Crawling API: Real-Time, On-Demand Extraction</h3><p>The <a href="https://crawlbase.com/docs/crawling-api/">Crawling API</a> retrieves specific pages whenever your system requests them. It is designed for precision, low latency, and integration into backend services.</p><p>You send a simple HTTP GET request and receive the page response a few seconds later.</p><p>Example request format:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl <span class="string">&#x27;https://api.crawlbase.com/?token=USER_TOKEN&amp;url=encodedTargetURL&#x27;</span></span><br></pre></td></tr></table></figure><p>This basic request can be implemented in virtually any programming language, making it easy to embed into existing applications, services, or pipelines. Crawlbase also provides official <a href="https://crawlbase.com/dashboard/api/libraries">libraries and SDKs</a> that simplify authentication, request handling, and error management, allowing faster integration without building custom HTTP logic.</p><h4 id="Best-suited-for"><a href="#Best-suited-for" class="headerlink" title="Best suited for:"></a>Best suited for:</h4><ul><li>Microservices and backend applications</li><li>Scheduled monitoring jobs</li><li>Event-driven workflows</li><li>Known lists of URLs</li><li>Real-time data requirements</li></ul><h4 id="Typical-Crawling-API-flow"><a href="#Typical-Crawling-API-flow" class="headerlink" title="Typical Crawling API flow:"></a>Typical Crawling API flow:</h4><p>Trigger → API Request → Parse Response → Store Data</p><p>Example scenarios:</p><ul><li>Checking product prices on demand</li><li>Enriching internal records with external data</li><li>To monitor competitor pages at regular intervals</li><li>Fetching documents or reports when events occur</li></ul><p>Because the API responds immediately, it fits naturally into synchronous workflows and services that need fresh data.</p><h3 id="Crawlbase-Enterprise-Crawler-Automated-Large-Scale-Crawling"><a href="#Crawlbase-Enterprise-Crawler-Automated-Large-Scale-Crawling" class="headerlink" title="Crawlbase Enterprise Crawler: Automated Large-Scale Crawling"></a>Crawlbase Enterprise Crawler: Automated Large-Scale Crawling</h3><p>The <a href="https://crawlbase.com/docs/crawler/">Enterprise Crawler</a> is designed for continuous, site-wide collection. Instead of requesting individual pages, you define crawl rules and schedules. The system discovers pages, executes crawls, and stores results for later retrieval.</p><p>The Crawler jobs can be initiated via the API, but operate as a managed asynchronous crawl system. You only need to add two parameters to enable asynchronous crawling: <code>&amp;callback=true</code> and <code>&amp;crawler=YourCrawlerName</code>.</p><p>Example request:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl <span class="string">&#x27;https://api.crawlbase.com/?token=USER_TOKEN&amp;url=encodedTargetURL&amp;callback=true&amp;crawler=YourCrawlerName&#x27;</span></span><br></pre></td></tr></table></figure><p>Instead of returning the page content immediately, the API responds with a Request ID (RID), for example:</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="punctuation">&#123;</span> <span class="attr">&quot;rid&quot;</span><span class="punctuation">:</span> <span class="string">&quot;1e92e8bf4618772871c14d4&quot;</span> <span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure><p>This indicates that the request has been accepted and placed in the processing queue.</p><p>You can retrieve the crawl results in two ways:</p><ul><li>Send results to your own webhook endpoint for full automation</li><li>Use <a href="https://crawlbase.com/cloud-storage-for-crawling-and-scraping">Crawlbase Cloud Storage</a> as a built-in webhook alternative for a simpler setup</li></ul><p>This asynchronous model allows large volumes of pages to be processed without blocking your application.</p><h4 id="Best-suited-for-1"><a href="#Best-suited-for-1" class="headerlink" title="Best suited for:"></a>Best suited for:</h4><ul><li>Monitoring entire websites or categories</li><li>Recurring bulk collection</li><li>Unknown or evolving URL sets</li><li>Content discovery pipelines</li><li>Large-scale dataset generation</li></ul><h4 id="Typical-Crawlbase-Crawler-flow"><a href="#Typical-Crawlbase-Crawler-flow" class="headerlink" title="Typical Crawlbase Crawler flow:"></a>Typical Crawlbase Crawler flow:</h4><p>Configure Crawl → Scheduled Execution → Results Stored → Batch Processing</p><p>Example scenarios:</p><ul><li>Tracking millions of product pages across categories</li><li>Monitoring news or media sources for new articles</li><li>Building searchable content indexes</li><li>Generating training datasets from public web sources</li></ul><p>This approach removes the need to maintain URL inventories or discovery logic, which becomes increasingly complex at scale.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">Try our <strong class="text-underline">Crawling API</strong>  to automate your data collection — used by 70k+ dev teams</p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get Started Now!" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'how-to-build-scalable-web', 'cta_type': 'try_crawling_api', 'cta_position': 'top','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/how-to-build-scalable-web/', 'page_title': 'How to Build a Scalable Web Data Pipeline With Crawlbase' });">Get Started Now!</a>    </div>  </div>  </div><h2 id="Integrating-Crawlbase-Into-ETL-Workflows"><a href="#Integrating-Crawlbase-Into-ETL-Workflows" class="headerlink" title="Integrating Crawlbase Into ETL Workflows"></a>Integrating Crawlbase Into ETL Workflows</h2><p>ETL stands for Extract, Transform, Load. In simple terms, you pull data from a source, reshape it into something structured, then store it in a central place like a database. Cloud providers such as <a href="https://aws.amazon.com/">Amazon (AWS)</a> and <a href="https://learn.microsoft.com/en-us/azure/architecture/data-guide/relational-data/etl">Microsoft</a> describe this process as the standard way organizations prepare data for reporting, analytics, and machine learning.</p><p>In a web data setup, Crawlbase effectively handles the extraction part by fetching the pages, while your pipeline takes care of transforming the raw content and loading the final results into storage.</p><h3 id="Using-the-Crawling-API-in-ETL-Pipelines"><a href="#Using-the-Crawling-API-in-ETL-Pipelines" class="headerlink" title="Using the Crawling API in ETL Pipelines"></a>Using the Crawling API in ETL Pipelines</h3><p>A typical integration looks like this:</p><ol><li>Request page content through the API</li><li>Parse the HTML or structured response</li><li>Extract the fields you care about</li><li>Clean and standardize the values</li><li>Save the result to your database or warehouse</li></ol><p>Teams often implement this with simple Python scripts, scheduled jobs, or workflow tools, depending on scale.</p><p>Destination systems commonly include:</p><ul><li>Relational databases such as <a href="https://www.postgresql.org/">PostgreSQL</a></li><li>Cloud warehouses such as <a href="https://cloud.google.com/bigquery">BigQuery</a></li><li>Search systems such as <a href="https://www.elastic.co/elasticsearch">Elasticsearch</a></li><li>Streaming platforms such as <a href="https://kafka.apache.org/">Kafka</a></li></ul><p>Because the API runs on demand, you can trigger these pipelines as often as needed, whether that is near real-time updates or periodic batch runs.</p><h3 id="Using-Crawlbase-Enterprise-Crawler-Outputs-in-Batch-Pipelines"><a href="#Using-Crawlbase-Enterprise-Crawler-Outputs-in-Batch-Pipelines" class="headerlink" title="Using Crawlbase Enterprise Crawler Outputs in Batch Pipelines"></a>Using Crawlbase Enterprise Crawler Outputs in Batch Pipelines</h3><p>Crawler outputs are normally consumed in batches with common patterns like:</p><ol><li>Set up a crawl project and schedule</li><li>Crawlbase collects pages automatically</li><li>Your pipeline fetches the completed results</li><li>New or changed records are parsed and cleaned</li><li>The processed data is written to your warehouse</li></ol><p>It may also include the following processing strategies:</p><ul><li>Full dataset refreshes</li><li>Incremental ingestion</li><li>Change detection</li><li>Storage for historical snapshots</li></ul><p>This approach is well-suited to reporting, market monitoring, and other workloads where data freshness is measured in hours or days rather than seconds.</p><h2 id="Automation-Patterns-for-Data-Pipeline"><a href="#Automation-Patterns-for-Data-Pipeline" class="headerlink" title="Automation Patterns for Data Pipeline"></a>Automation Patterns for Data Pipeline</h2><p>Production pipelines rely heavily on automation. Crawlbase integrates cleanly with common orchestration approaches, which typical includes:</p><ol><li><p><strong>Scheduler-based extraction:</strong> Cron jobs or cloud schedulers trigger API requests at defined intervals.</p></li><li><p><strong>Workflow orchestration:</strong> Tools like <a href="https://airflow.apache.org/">Apache Airflow</a> coordinate multi-step pipelines, handle dependencies, retry failed tasks, and provide visibility into job status.</p></li><li><p><strong>Serverless pipelines:</strong> Event triggers invoke functions that fetch, process, and store data without dedicated servers.</p></li><li><p><strong>Batch ingestion windows:</strong> Large datasets are processed during scheduled windows to optimize cost and performance.</p></li></ol><p>In all cases, Crawlbase handles the extraction layer while orchestration tools manage processing and storage.</p><h2 id="Decision-Guide-API-vs-Enterprise-Crawler"><a href="#Decision-Guide-API-vs-Enterprise-Crawler" class="headerlink" title="Decision Guide: API vs Enterprise Crawler"></a>Decision Guide: API vs Enterprise Crawler</h2><p>Choosing the right tool depends primarily on how data will be consumed.</p><p>Use the Crawling API when you need:</p><ul><li>Real-time or near real-time data</li><li>Specific known URLs</li><li>Low-latency responses</li><li>Tight integration with backend services</li><li>Fine-grained control over requests</li></ul><p>Use Crawlbase Enterprise Crawler when you need:</p><ul><li>Ongoing monitoring of large sites</li><li>Automatic discovery of new pages</li><li>Recurring bulk collection</li><li>Batch processing workflows</li><li>Reduced operational involvement</li></ul><p>Many production systems use both simultaneously. The API handles targeted retrieval while the Crawler maintains broad coverage.</p><h2 id="How-to-Scale-Without-Building-Scraping-Infrastructure"><a href="#How-to-Scale-Without-Building-Scraping-Infrastructure" class="headerlink" title="How to Scale Without Building Scraping Infrastructure"></a>How to Scale Without Building Scraping Infrastructure</h2><p>As data needs grow, infrastructure complexity typically grows faster. Parallelization, reliability, and storage become major concerns.</p><p>Key scaling considerations include:</p><ul><li>Managing concurrent requests</li><li>Handling failures and retries safely</li><li>Ensuring idempotent processing</li><li>Eliminating duplicate records</li><li>Optimizing storage costs</li><li>Monitoring data freshness and completeness</li></ul><p>Building these capabilities internally requires significant engineering effort. Crawlbase externalizes much of this complexity. Scaling becomes a configuration task rather than a network engineering project.</p><p>This shift allows teams to invest in analytics, modeling, and product features instead of maintaining collection systems.</p><h2 id="Next-Steps-to-Scale-Your-Data-Pipeline"><a href="#Next-Steps-to-Scale-Your-Data-Pipeline" class="headerlink" title="Next Steps to Scale Your Data Pipeline"></a>Next Steps to Scale Your Data Pipeline</h2><p>A scalable web data pipeline depends on a reliable ingestion layer. Without it, downstream systems cannot produce consistent insights regardless of how advanced they are.</p><p>Crawlbase enables teams to treat web data as a stable input rather than a fragile custom project. The Crawling API provides precise, on-demand extraction for real-time needs, while Crawlbase Enterprise Crawler delivers automated, large-scale coverage for continuous monitoring.</p><p>By separating data acquisition from data processing, organizations can reduce operational overhead, improve reliability, and focus on generating value from data instead of fighting infrastructure.</p><p>If you are working with web data, <a href="https://crawlbase.com/signup?signup=blog">try integrating Crawlbase</a> now into your pipeline and see how much time and maintenance effort it can remove from your workflow.</p><h2 id="Frequently-Asked-Questions-FAQs"><a href="#Frequently-Asked-Questions-FAQs" class="headerlink" title="Frequently Asked Questions (FAQs)"></a>Frequently Asked Questions (FAQs)</h2><h3 id="What-is-the-difference-between-Crawling-API-and-Crawlbase-Enterprise-Crawler"><a href="#What-is-the-difference-between-Crawling-API-and-Crawlbase-Enterprise-Crawler" class="headerlink" title="What is the difference between Crawling API and Crawlbase Enterprise Crawler?"></a>What is the difference between Crawling API and Crawlbase Enterprise Crawler?</h3><p>The Crawling API retrieves specific pages on demand, making it ideal for real-time workflows. The Enterprise Crawler performs automated, large-scale crawling and discovery across entire websites on a schedule.</p><h3 id="Can-Crawlbase-integrate-with-existing-ETL-pipelines"><a href="#Can-Crawlbase-integrate-with-existing-ETL-pipelines" class="headerlink" title="Can Crawlbase integrate with existing ETL pipelines?"></a>Can Crawlbase integrate with existing ETL pipelines?</h3><p>Yes. Crawlbase functions as the upstream extraction layer and outputs data that standard ETL tools can process and load into storage systems.</p><h3 id="Do-I-still-need-to-manage-proxies-or-anti-bot-defenses"><a href="#Do-I-still-need-to-manage-proxies-or-anti-bot-defenses" class="headerlink" title="Do I still need to manage proxies or anti-bot defenses?"></a>Do I still need to manage proxies or anti-bot defenses?</h3><p>No. Crawlbase handles IP rotation, request routing, and mitigation techniques required to retrieve pages reliably.</p><h3 id="Is-Crawlbase-suitable-for-real-time-applications"><a href="#Is-Crawlbase-suitable-for-real-time-applications" class="headerlink" title="Is Crawlbase suitable for real-time applications?"></a>Is Crawlbase suitable for real-time applications?</h3><p>Yes. The Crawling API supports low-latency, on-demand retrieval, making it suitable for backend services and monitoring systems that require fresh data.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Building a scalable web data pipeline with Crawlbase involves using the &lt;a href=&quot;https://crawlbase.com/crawling-api-avoid-captchas-blocks&quot;&gt;Crawling API&lt;/a&gt; for real-time page retrieval and the &lt;a href=&quot;https://crawlbase.com/anonymous-crawler-asynchronous-scraping&quot;&gt;Enterprise Crawler&lt;/a&gt; for automated large-scale collection, then feeding the results into an ETL system for parsing, transformation, and storage. This removes the need to manage proxies, IP rotation, or JavaScript rendering infrastructure internally, allowing reliable web data collection with your pipeline even as target sites change or deploy anti-bot defenses.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="build scalable web data scraper" scheme="https://crawlbase.com/blog/tags/build-scalable-web-data-scraper/"/>
    
    <category term="etl integration" scheme="https://crawlbase.com/blog/tags/etl-integration/"/>
    
    <category term="build web data pipeline" scheme="https://crawlbase.com/blog/tags/build-web-data-pipeline/"/>
    
  </entry>
  
  <entry>
    <title>How AI Proxies Work (Complete 2026 Guide)</title>
    <link href="https://crawlbase.com/blog/how-ai-proxies-work/"/>
    <id>https://crawlbase.com/blog/how-ai-proxies-work/</id>
    <published>2026-03-03T18:37:02.000Z</published>
    <updated>2026-04-24T11:53:23.191Z</updated>
    
    <content type="html"><![CDATA[<div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Quick Answer</h2>  </div>  <p class="banner-body">An AI proxy uses machine learning and adaptive logic to manage request fingerprinting, session behavior, and IP routing; automatically adapting to target defenses in real time. Unlike traditional proxies that route traffic through a fixed IP, AI proxies respond to failure signals and continuously optimize request configurations to maintain high success rates against modern anti-bot systems.</p></div><span id="more"></span><p>If you’re considering <a href="https://crawlbase.com/smart-proxy">proxy infrastructure for scraping</a>, data collection, or large-scale automation, it’s important to understand what an AI proxy is and how it differs from traditional proxy types. This guide discusses the technical mechanisms, key components, and the true value of AI-powered proxy technology.</p><h2 id="Key-Takeaways"><a href="#Key-Takeaways" class="headerlink" title="Key Takeaways"></a>Key Takeaways</h2><ul><li>Traditional proxies mask IPs only; AI proxies adapt fingerprints, sessions, and routing in real time.</li><li>AI proxies use reinforcement learning and classification models to update routing strategies automatically.</li><li>Success rates on hardened targets can exceed 90% with AI proxies versus 40–60% with static residential proxies.</li><li>The AI decision layer adds 10–50ms of overhead per request, a worthwhile trade-off for complex targets.</li><li>AI proxies are most valuable at scale; standard proxies remain sufficient for low-volume, low-risk targets.</li></ul><h2 id="Why-Traditional-Proxies-Fail-Against-Modern-Targets"><a href="#Why-Traditional-Proxies-Fail-Against-Modern-Targets" class="headerlink" title="Why Traditional Proxies Fail Against Modern Targets"></a>Why Traditional Proxies Fail Against Modern Targets</h2><p>A standard proxy, whether <a href="https://crawlbase.com/blog/datacenter-vs-residential-proxies/">datacenter</a>, residential, or ISP, does one thing: mask the origin IP. It routes your traffic through a third-party IP, so the target server sees a different address than yours.</p><p>This works well for simple targets. It breaks down quickly into four common scenarios:</p><ul><li>Behavioral analysis: The target scores session behavior, not just IP reputation.</li><li>JavaScript rendering: Dynamic content requires JS execution before data is accessible.</li><li>Multi-signal fingerprinting: Anti-bot systems inspect HTTP headers, TLS cipher suites, HTTP&#x2F;2 frame order, and browser traits.</li><li>Pattern-based rate limiting: Dynamic rate limits trigger on session patterns rather than per-IP thresholds.</li></ul><p>Modern anti-bot platforms like <a href="https://www.cloudflare.com/">Cloudflare</a>, <a href="https://datadome.co/">DataDome</a>, and <a href="https://www.akamai.com/">Akamai</a> Bot Manager have moved well beyond IP blocklists. Relying on a rotating residential proxy pool alone no longer sustains high success rates against hardened targets.</p><h2 id="What-Makes-a-Proxy-“AI-Powered”"><a href="#What-Makes-a-Proxy-“AI-Powered”" class="headerlink" title="What Makes a Proxy “AI-Powered”?"></a>What Makes a Proxy “AI-Powered”?</h2><p>The term AI proxy refers to a system that includes smart, adaptive behavior at one or more stages of the request pipeline. This generally involves three capabilities:</p><h3 id="Adaptive-Request-Fingerprinting"><a href="#Adaptive-Request-Fingerprinting" class="headerlink" title="Adaptive Request Fingerprinting"></a>Adaptive Request Fingerprinting</h3><p>Every HTTP request carries metadata beyond the IP address. Anti-bot systems build fingerprint profiles from:</p><ul><li>User-Agent strings and Accept&#x2F;Accept-Language headers</li><li>TLS cipher suites and extension order: Specifically, the sequence of extensions like <code>server_name</code>, <code>status_request</code>, <code>supported_groups</code>, and <code>signature_algorithms</code> in the ClientHello message</li><li>HTTP&#x2F;2 frame settings: Including <code>SETTINGS</code> frame parameters (header table size, max concurrent streams, initial window size) and the order of pseudo-headers (<code>:method</code>, <code>:path</code>, <code>:scheme</code>, <code>:authority</code>)</li><li>JA3&#x2F;JA4 fingerprints: Hashes derived from TLS handshake parameters that uniquely identify a client configuration</li></ul><p>AI-powered proxy technology generates and manages request fingerprints that align with real browser profiles and adapts them dynamically based on target feedback. When a fingerprint configuration triggers blocks, the system learns from this and rotates to a different profile automatically.</p><h3 id="Behavioral-Session-Management"><a href="#Behavioral-Session-Management" class="headerlink" title="Behavioral Session Management"></a>Behavioral Session Management</h3><p>Human browsing behavior follows recognizable patterns: varying inter-request timing, natural navigation paths, realistic referrer chains, and persistent cookie state. Bot traffic is typically uniform, with constant request intervals, absent referrer headers, and no session continuity.</p><p>An AI proxy manages session behavior to mimic human patterns by controlling request cadence, maintaining cookie state, simulating realistic navigation sequences, and managing session lifecycle to avoid behavioral fingerprinting triggers.</p><h3 id="Target-Aware-Routing-and-Retry-Logic"><a href="#Target-Aware-Routing-and-Retry-Logic" class="headerlink" title="Target-Aware Routing and Retry Logic"></a>Target-Aware Routing and Retry Logic</h3><p>Not every IP in a proxy pool performs equally against every target. AI proxy systems build and continuously update a model of which IP types, locations, and configurations yield the highest success rates against specific domains.</p><ul><li>Routing logic: When a request fails or returns an unexpected response (e.g., a CAPTCHA page, a soft redirect), the system classifies the failure type, updates its routing model, and selects a different configuration for the retry.</li><li>What this prevents: Blind retries with the same configuration, the leading cause of escalating block rates on rule-based proxy managers.</li></ul><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Stop Wrestling with Proxy Lists</h2>  </div>  <p class="banner-body">Our Smart AI Proxy uses machine learning to automatically rotate 1M+ residential and datacenter IPs, bypassing CAPTCHA and blocks for you.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="The-ML-Models-Behind-AI-Proxy-Decision-Making"><a href="#The-ML-Models-Behind-AI-Proxy-Decision-Making" class="headerlink" title="The ML Models Behind AI Proxy Decision-Making"></a>The ML Models Behind AI Proxy Decision-Making</h2><p>AI proxy systems typically rely on a combination of machine learning approaches:</p><ul><li>Reinforcement Learning (RL): Used for path and routing optimization. The proxy agent receives a reward signal (success&#x2F;failure&#x2F;soft block) for each request and updates its IP selection and fingerprint policies to maximize long-term success rates per target domain.</li><li>Classification models: Lightweight supervised models classify the type of failure response (hard block, CAPTCHA challenge, rate limit, soft redirect) to trigger the appropriate retry strategy.</li><li>Contextual bandits: A simplified RL approach used for fast A&#x2F;B selection between fingerprint profiles and IP types when full RL training data is insufficient for a new target.</li></ul><p>These models run continuously across all requests in the system. The more traffic a target receives, the more accurate the models become for that domain.</p><h2 id="How-an-AI-Proxy-Processes-a-Request-Step-by-Step"><a href="#How-an-AI-Proxy-Processes-a-Request-Step-by-Step" class="headerlink" title="How an AI Proxy Processes a Request (Step by Step)"></a>How an AI Proxy Processes a Request (Step by Step)</h2><p>Here is how a request flows through an AI proxy system:</p><ol><li><p><strong>Request intake and classification</strong>: The client sends a request to the proxy endpoint. The system classifies the target domain against its known profile: which anti-bot stack it uses, observed failure patterns, and which session configuration has historically yielded the best results.</p></li><li><p><strong>Fingerprint and session configuration</strong>: Before sending the request, the proxy assigns a browser fingerprint profile and session context. Setting headers, TLS configuration, HTTP&#x2F;2 frame parameters, and timing to align with expected human behavior for that target.</p></li><li><p><strong>IP selection</strong>: The routing layer selects an IP from the pool based on the target classification model, filtering by location, IP type (residential, datacenter, mobile), and performance history against that specific domain.</p></li><li><p><strong>Request execution and response analysis</strong>: The request is sent. The system analyzes the response not only for the data payload but also for signals indicating whether the request succeeded, hit a soft block, or triggered a hard block.</p></li><li><p><strong>Feedback loop</strong>: The outcome is fed back into the routing and fingerprinting models. Successful configurations are reinforced; those that triggered blocks are deprioritized or removed for that target.</p></li></ol><p>This loop runs continuously across all requests. As the system processes more data, proxy infrastructure accuracy improves per target over time.</p><h2 id="AI-Proxy-vs-Smart-Proxy-Technical-Comparison"><a href="#AI-Proxy-vs-Smart-Proxy-Technical-Comparison" class="headerlink" title="AI Proxy vs. Smart Proxy: Technical Comparison"></a>AI Proxy vs. Smart Proxy: Technical Comparison</h2><p>The terms AI proxy and smart proxy are often used interchangeably, but they describe meaningfully different capabilities:</p><table><thead><tr><th>Feature</th><th>Standard Proxy</th><th>Smart Proxy</th><th>AI Proxy</th></tr></thead><tbody><tr><td>IP rotation</td><td>Manual &#x2F; rule-based</td><td>Automatic</td><td>ML-optimized per target</td></tr><tr><td>Retry logic</td><td>Fixed (e.g., on 429)</td><td>Configurable rules</td><td>Failure-type classification</td></tr><tr><td>Fingerprint management</td><td>None</td><td>Static or templated</td><td>Dynamic, per-target adaptation</td></tr><tr><td>Session behavior</td><td>None</td><td>Basic cookie handling</td><td>Human-pattern simulation</td></tr><tr><td>Target learning</td><td>None</td><td>None</td><td>Continuous RL model updates</td></tr><tr><td>JavaScript rendering</td><td>No</td><td>Varies</td><td>Yes (headless browser layer)</td></tr><tr><td>Failure handling</td><td>Blind retry</td><td>Rule-triggered retry</td><td>Model-driven reconfiguration</td></tr></tbody></table><p>The core architectural difference: rule-based systems treat failures as exceptions; AI proxy systems treat failures as training data.</p><h2 id="Latency-Overhead-of-the-AI-Decision-Layer"><a href="#Latency-Overhead-of-the-AI-Decision-Layer" class="headerlink" title="Latency Overhead of the AI Decision Layer"></a>Latency Overhead of the AI Decision Layer</h2><p>A common concern with AI proxy systems is the added latency from model inference. In practice:</p><ul><li>The AI decision layer (fingerprint selection, IP scoring, session assignment) typically adds 10–50ms per request, primarily from routing model lookups and session state resolution.</li><li>For targets where a static proxy would retry 2–4 times due to blocks, the net latency of an AI proxy is lower despite the per-request overhead.</li><li>Warm-path caching of per-domain model outputs reduces repeated inference cost significantly at scale.</li></ul><p>For high-throughput pipelines processing thousands of requests per minute, this overhead is negligible relative to the reduction in failed-request retries.</p><h2 id="Where-AI-Proxy-Technology-Is-Most-Effective"><a href="#Where-AI-Proxy-Technology-Is-Most-Effective" class="headerlink" title="Where AI Proxy Technology Is Most Effective"></a>Where AI Proxy Technology Is Most Effective</h2><p>The performance advantage of AI proxies is most pronounced in these scenarios:</p><ul><li><strong>Hardened e-commerce and retail targets</strong>: Sites using aggressive anti-bot measures to protect pricing, inventory, or product data. Behavioral analysis is standard here, and static proxy settings often fail within hours of deployment.</li><li><strong>News and media aggregation</strong>: Frequent content updates require high-throughput scraping with fast session cycling. AI session management handles this more reliably than manual configurations.</li><li><strong>Financial and market data</strong>: Targets with strict per-session rate limits where session fingerprinting is as critical as IP diversity.</li><li><strong>Multi-region data collection</strong>: AI routing optimizes IP selection by geography automatically, important for targets that serve region-specific content or apply geo-based rate limiting.</li></ul><p>Standard proxies remain sufficient for low-volume, low-risk targets with minimal anti-bot protection. The ROI on AI-powered proxy infrastructure scales with target complexity and collection volume.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'how-ai-proxies-work', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="Why-AI-Proxy-Infrastructure-Matters-at-Scale"><a href="#Why-AI-Proxy-Infrastructure-Matters-at-Scale" class="headerlink" title="Why AI Proxy Infrastructure Matters at Scale"></a>Why AI Proxy Infrastructure Matters at Scale</h2><p><a href="https://crawlbase.com/smart-proxy">AI proxies</a> work by layering adaptive intelligence across three parts of the proxy stack: request fingerprinting, session behavior management, and IP routing. Unlike static configurations, they react to target feedback in real time, adjusting automatically when detection patterns change, without requiring manual tuning.</p><p>For teams running <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">large-scale data collection</a> against modern anti-bot systems, this adaptability is the difference between stable success rates and an ongoing configuration maintenance burden.</p><p>To see how these principles are applied in a production product, the Crawlbase Smart AI Proxy implements this architecture within a managed infrastructure designed for high-volume scraping and data collection.</p><p><a href="https://crawlbase.com/signup?signup=blog">Sign up now</a> and get 5,000 free credits to test our AI Proxy.</p><h2 id="How-AI-Proxies-Work-Frequently-Asked-Questions"><a href="#How-AI-Proxies-Work-Frequently-Asked-Questions" class="headerlink" title="How AI Proxies Work - Frequently Asked Questions"></a>How AI Proxies Work - Frequently Asked Questions</h2><h3 id="What-is-an-AI-proxy-in-simple-terms"><a href="#What-is-an-AI-proxy-in-simple-terms" class="headerlink" title="What is an AI proxy in simple terms?"></a>What is an AI proxy in simple terms?</h3><p>An AI proxy is a proxy server that uses machine learning to automatically adjust how it routes requests, manages sessions, and selects IP addresses based on the target website’s response. Rather than following fixed rules, it learns what works for each target and adapts in real time.</p><h3 id="How-does-an-AI-proxy-handle-CAPTCHA-and-blocks"><a href="#How-does-an-AI-proxy-handle-CAPTCHA-and-blocks" class="headerlink" title="How does an AI proxy handle CAPTCHA and blocks?"></a>How does an AI proxy handle CAPTCHA and blocks?</h3><p>When an AI proxy encounters a CAPTCHA or block response, it classifies the failure type and feeds that signal back into its routing and fingerprinting models. It then retries using a different IP, fingerprint, or session configuration based on what has historically succeeded against that target — without requiring manual input.</p><h3 id="Is-an-AI-proxy-the-same-as-a-smart-proxy"><a href="#Is-an-AI-proxy-the-same-as-a-smart-proxy" class="headerlink" title="Is an AI proxy the same as a smart proxy?"></a>Is an AI proxy the same as a smart proxy?</h3><p>Not always. A smart proxy typically refers to a proxy with routing intelligence, such as automatic geo-selection or retry logic. An AI proxy specifically indicates that machine learning models, including reinforcement learning and classifiers, drive adaptive behavior across fingerprinting, session management, and routing. See the comparison table above for a full breakdown.</p><h3 id="Do-AI-proxies-work-with-JavaScript-heavy-sites"><a href="#Do-AI-proxies-work-with-JavaScript-heavy-sites" class="headerlink" title="Do AI proxies work with JavaScript-heavy sites?"></a>Do AI proxies work with JavaScript-heavy sites?</h3><p>Yes. AI proxies typically integrate with headless browser infrastructure or rendering engines to manage JavaScript execution. The AI layer adjusts request configuration and session behavior, while the rendering layer handles JS execution before data extraction.</p><h3 id="When-should-I-use-an-AI-proxy-instead-of-a-standard-residential-proxy"><a href="#When-should-I-use-an-AI-proxy-instead-of-a-standard-residential-proxy" class="headerlink" title="When should I use an AI proxy instead of a standard residential proxy?"></a>When should I use an AI proxy instead of a standard residential proxy?</h3><p>If your target uses behavioral fingerprinting, dynamic rate limiting, or a dedicated anti-bot platform like Cloudflare, DataDome, or Akamai, a standard residential proxy will likely produce declining success rates over time. AI proxies are the better choice when you need to maintain reliable success rates against these targets at scale.</p><h3 id="What-does-AI-proxy-integration-look-like-and-what-does-it-cost"><a href="#What-does-AI-proxy-integration-look-like-and-what-does-it-cost" class="headerlink" title="What does AI proxy integration look like, and what does it cost?"></a>What does AI proxy integration look like, and what does it cost?</h3><p>Most AI proxy providers offer both API and SDK integration. SDK integration (typically available in Python, Node.js, and Go) provides the simplest path, replacing your existing proxy URL configuration with a few lines of initialization code. API integration gives more granular control over session parameters and routing hints. Pricing is generally usage-based (per GB or per 1,000 requests), with managed infrastructure included. The cost differential versus standard residential proxies is offset by reduced retry overhead and fewer failed requests requiring manual intervention.</p><h3 id="Is-traffic-routed-through-an-AI-proxy-secure-and-private"><a href="#Is-traffic-routed-through-an-AI-proxy-secure-and-private" class="headerlink" title="Is traffic routed through an AI proxy secure and private?"></a>Is traffic routed through an AI proxy secure and private?</h3><p>Reputable AI proxy providers encrypt traffic between the client and proxy endpoint via TLS. However, because the proxy intermediates the request, the provider can log request metadata (target domain, timestamps, IP used) for routing model training. For sensitive workloads, review the provider’s data retention and logging policies before deployment. AI-routed traffic is subject to the same legal and terms-of-service constraints as any proxy traffic. The AI layer does not change the legal profile of the requests.​</p>]]></content>
    
    
    <summary type="html">&lt;div class=&quot;callout-banner&quot;&gt;
  &lt;div class=&quot;banner-header&quot;&gt;
    &lt;img src=&quot;/blog/images/flashlight-icon-blue.png&quot; srcset=&quot;/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x&quot; alt=&quot;Flashlight Icon&quot;/&gt;
    &lt;h2 class=&quot;banner-header-label&quot;&gt;Quick Answer&lt;/h2&gt;
  &lt;/div&gt;
  &lt;p class=&quot;banner-body&quot;&gt;An AI proxy uses machine learning and adaptive logic to manage request fingerprinting, session behavior, and IP routing; automatically adapting to target defenses in real time. Unlike traditional proxies that route traffic through a fixed IP, AI proxies respond to failure signals and continuously optimize request configurations to maintain high success rates against modern anti-bot systems.&lt;/p&gt;
&lt;/div&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="ai powered proxy technology" scheme="https://crawlbase.com/blog/tags/ai-powered-proxy-technology/"/>
    
    <category term="smart ai proxy" scheme="https://crawlbase.com/blog/tags/smart-ai-proxy/"/>
    
    <category term="how ai proxy works" scheme="https://crawlbase.com/blog/tags/how-ai-proxy-works/"/>
    
    <category term="smart proxy vs ai proxy" scheme="https://crawlbase.com/blog/tags/smart-proxy-vs-ai-proxy/"/>
    
  </entry>
  
  <entry>
    <title>Build a Search Engine Tool with Smart AI Proxy</title>
    <link href="https://crawlbase.com/blog/how-to-build-search-engine-tool/"/>
    <id>https://crawlbase.com/blog/how-to-build-search-engine-tool/</id>
    <published>2026-02-26T15:59:42.000Z</published>
    <updated>2026-04-24T11:53:23.227Z</updated>
    
    <content type="html"><![CDATA[<div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Quick Answer</h2>  </div>  <p class="banner-body">You can build a production-grade search engine or SERP data tool by routing all search requests through Crawlbase Smart AI Proxy, which provides rotating IPs, geo targeting, and anti-bot mitigation behind a standard proxy interface.</p></div><span id="more"></span><p>Instead of operating your own proxy fleet or headless browser infrastructure, your application generates search queries, sends them through a single endpoint, and receives usable results that can be normalized and served to users. This approach scales from prototype to high-volume workloads without collapsing due to IP bans, CAPTCHA, geo mismatches, or rate limits.</p><p><a href="https://crawlbase.com/signup?signup=blog">Smart AI Proxy</a> becomes the data collection layer of your search pipeline. Your code handles query logic and product features, while Crawlbase manages network-level reliability and access to search engines across regions. The sections below walk through the real constraints of SERP scraping and demonstrate how to implement a working tool end-to-end using this approach.</p><h2 id="Why-Is-Scraping-Search-Results-at-Scale-So-Difficult"><a href="#Why-Is-Scraping-Search-Results-at-Scale-So-Difficult" class="headerlink" title="Why Is Scraping Search Results at Scale So Difficult?"></a>Why Is Scraping Search Results at Scale So Difficult?</h2><p>Fetching a few search pages is easy. Fetching thousands every hour without getting blocked is not. <a href="https://developer.mozilla.org/en-US/docs/Glossary/Search_engine">Search engines</a> are optimized for human use, and automated traffic stands out quickly.</p><h3 id="1-IP-Blocking-and-Bans"><a href="#1-IP-Blocking-and-Bans" class="headerlink" title="1. IP Blocking and Bans"></a>1. IP Blocking and Bans</h3><p>If many requests originate from the same address, they start to look suspicious. Once thresholds are crossed, responses may switch to errors, empty pages, or verification prompts. A single cloud instance can work during testing and then fail once real traffic arrives.</p><h3 id="2-Geo-Restrictions-and-Localized-Results"><a href="#2-Geo-Restrictions-and-Localized-Results" class="headerlink" title="2. Geo-Restrictions and Localized Results"></a>2. Geo-Restrictions and Localized Results</h3><p>Search results are not universal. A query from London can produce different rankings and local listings than the same query from New York or Berlin. If your application depends on region-specific data, requests must appear to come from those locations.</p><h3 id="3-CAPTCHA-and-Anti-Bot-Measures"><a href="#3-CAPTCHA-and-Anti-Bot-Measures" class="headerlink" title="3. CAPTCHA and Anti-Bot Measures"></a>3. CAPTCHA and Anti-Bot Measures</h3><p>Modern search platforms rely on layered defenses. Even when a request succeeds technically, the returned page may be a challenge rather than the actual results. Handling these systems reliably requires infrastructure that adapts continuously.</p><h3 id="4-Rate-Limits-and-Throttling"><a href="#4-Rate-Limits-and-Throttling" class="headerlink" title="4. Rate Limits and Throttling"></a>4. Rate Limits and Throttling</h3><p>High-frequency traffic from identifiable sources is shaped or blocked. Without distribution across many routes, throughput eventually drops to zero regardless of how efficient your code is.</p><p>Building all of this internally means maintaining proxy pools, monitoring failures, rotating addresses, and reacting to changes in detection systems. For most teams, that becomes an operational burden rather than a differentiating feature.</p><h2 id="Why-Is-Smart-AI-Proxy-Rotation-the-Best-Fix-for-SERP-Scraping"><a href="#Why-Is-Smart-AI-Proxy-Rotation-the-Best-Fix-for-SERP-Scraping" class="headerlink" title="Why Is Smart AI Proxy Rotation the Best Fix for SERP Scraping?"></a>Why Is Smart AI Proxy Rotation the Best Fix for SERP Scraping?</h2><p>Crawlbase Smart AI Proxy sits between your application and the target site. You configure it like a normal proxy, send requests as usual, and receive responses as if you had connected directly. The difference is that each request is routed through infrastructure designed specifically for automated data collection.</p><p>Key characteristics:</p><p>• Requests are distributed across many IPs instead of one<br>• Traffic patterns are tuned to avoid common block triggers<br>• Location targeting can be applied when needed (Premium)<br>• No special client libraries are required</p><p>Optional behavior is controlled through the <code>CrawlbaseAPI-Parameters</code> header. For example, structured parsing for Google can be enabled without changing your request logic.</p><p>Connection details:</p><ul><li>HTTPS (recommended): <code>https://smartproxy.crawlbase.com:8013</code></li><li>HTTP: <code>http://smartproxy.crawlbase.com:8012</code></li><li>Authentication: Your Crawlbase token or authentication key as the proxy username.</li></ul><p><strong>Important:</strong> When routing through Smart AI Proxy, <a href="https://www.tutorialpedia.org/blog/how-are-ssl-certificates-verified/">SSL verification</a> for the destination is typically disabled because the proxy must inspect traffic to apply routing logic and response handling. In Python, this corresponds to <code>verify=False</code>.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating IPs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Code-Overview-What-Does-This-SERP-Tool-Actually-Do"><a href="#Code-Overview-What-Does-This-SERP-Tool-Actually-Do" class="headerlink" title="Code Overview: What Does This SERP Tool Actually Do?"></a>Code Overview: What Does This SERP Tool Actually Do?</h2><p>A SERP tool consists of multiple components, but only one communicates with external search engines. Smart AI Proxy sits at this boundary as the outbound data collection layer.</p><img src="/blog/how-to-build-search-engine-tool/search-engine-tool-architecture.jpg" class="" title="Illustration of Search Engine Tool Technical Architecture " alt="Illustration of Search Engine Tool Technical Architecture"><h3 id="Simplified-flow-of-the-Search-Engine-Tool-Architecture"><a href="#Simplified-flow-of-the-Search-Engine-Tool-Architecture" class="headerlink" title="Simplified flow of the Search Engine Tool Architecture:"></a>Simplified flow of the Search Engine Tool Architecture:</h3><ol><li>A user submits a query.</li><li>Your application builds the corresponding search URL.</li><li>The request is sent through Smart AI Proxy.</li><li>Results are returned from the search engine.</li><li>The data is normalized and stored or displayed.</li></ol><p>Because every outbound request goes through the proxy, the rest of your system remains insulated from blocking issues.</p><h2 id="How-to-Fetch-SERP-Data-Using-Smart-AI-Proxy"><a href="#How-to-Fetch-SERP-Data-Using-Smart-AI-Proxy" class="headerlink" title="How to Fetch SERP Data Using Smart AI Proxy"></a>How to Fetch SERP Data Using Smart AI Proxy</h2><p>A production-ready SERP tool follows this end-to-end flow:</p><ol><li><strong>Accept query -</strong> Your app receives a user search string.</li><li><strong>Query normalization -</strong> Convert input into a valid search engine URL.</li><li><strong>SERP retrieval -</strong> Send the request through Smart AI Proxy.</li><li><strong>Structured extraction -</strong> Receive machine-readable data (JSON).</li><li><strong>Downstream -</strong> It stores, ranks, filters, or displays results.</li></ol><p>A functional search engine tool needs a repeatable process that turns text input into structured results. In practice, the fragile part is not parsing or storage but maintaining access to the source sites as volume grows. Smart AI Proxy removes that instability, so the pipeline behaves consistently.</p><p>You can implement this workflow in any programming language that can send standard HTTP requests. For this guide, the examples use <a href="https://www.python.org/">Python</a> because it is widely available and easy to run locally, but the same approach works with Node.js, Go, Java, C#, etc.</p><p>Increasing traffic mostly should affect cost and processing capacity rather than reliability once the proxy layer is in place.</p><h3 id="Step-1-Accept-and-Normalize-the-User-Query"><a href="#Step-1-Accept-and-Normalize-the-User-Query" class="headerlink" title="Step 1: Accept and Normalize the User Query"></a>Step 1: Accept and Normalize the User Query</h3><p>Search engines expect properly encoded parameters. Raw input such as:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">best coffee shops Paris</span><br></pre></td></tr></table></figure><p>Converted into a valid URL:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">https://www.google.com/search?q=best+coffee+shops+Paris</span><br></pre></td></tr></table></figure><p>Encoding ensures special characters, spaces, and non-ASCII text do not break the request. In Python, this is handled with <code>quote_plus</code>.</p><h3 id="Step-2-Construct-the-Target-SERP-URL"><a href="#Step-2-Construct-the-Target-SERP-URL" class="headerlink" title="Step 2: Construct the Target SERP URL"></a>Step 2: Construct the Target SERP URL</h3><p>The URL should be generated programmatically. For a basic Google query, only the q parameter is required, but production systems often support additional options such as:</p><p>• Pagination<br>• Language parameters<br>• Safe search flags<br>• Device variants<br>• Regional targeting (Premium feature)</p><p>Keeping URL construction in one place makes it easier to extend later.</p><h3 id="Step-3-Route-the-Request-Through-Smart-AI-Proxy"><a href="#Step-3-Route-the-Request-Through-Smart-AI-Proxy" class="headerlink" title="Step 3: Route the Request Through Smart AI Proxy"></a>Step 3: Route the Request Through Smart AI Proxy</h3><p>Direct requests to search engines quickly fail under load. Instead, configure your HTTP client to use Smart AI Proxy as the outbound gateway.</p><p>Key configuration elements:</p><p>• Proxy endpoint (HTTP or HTTPS)<br>• Authentication using your Crawlbase token<br>• Standard proxy configuration in your HTTP library</p><p>From your application’s perspective, this behaves like any corporate proxy. The difference is that requests are transparently routed through infrastructure optimized for scraping workloads.</p><h3 id="Step-4-Request-Structured-Results"><a href="#Step-4-Request-Structured-Results" class="headerlink" title="Step 4: Request Structured Results"></a>Step 4: Request Structured Results</h3><p>Smart AI Proxy supports passing parameters via the <a href="https://crawlbase.com/docs/smart-proxy/parameters/">CrawlbaseAPI-Parameters</a> header. To parse the HTML content automatically, simply add:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">autoparse=true</span><br></pre></td></tr></table></figure><p>The response includes organic results, ads, local packs, related questions, and status information in JSON format. This removes the need for manual HTML parsing in many scenarios.</p><h3 id="Step-5-Handle-Response-Validation-and-Errors"><a href="#Step-5-Handle-Response-Validation-and-Errors" class="headerlink" title="Step 5: Handle Response Validation and Errors"></a>Step 5: Handle Response Validation and Errors</h3><p>Production systems should verify that the request succeeded before processing the payload. Typical checks include:</p><p>• HTTP status codes<br>• Proxy status indicators<br>• Presence of expected fields<br>• Retry logic for transient failures</p><p>The example below performs basic validation using <code>raise_for_status()</code>.</p><h3 id="Step-6-Integrate-With-Your-Application-Pipeline"><a href="#Step-6-Integrate-With-Your-Application-Pipeline" class="headerlink" title="Step 6: Integrate With Your Application Pipeline"></a>Step 6: Integrate With Your Application Pipeline</h3><p>Once retrieved, SERP data can support many use cases:</p><p>• Building a custom search interface<br>• Competitive analysis tools<br>• SEO monitoring dashboards<br>• Market research datasets<br>• AI Training datasets</p><p>Most systems normalize results into a consistent schema before storage to support analytics and ranking operations.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'how-to-build-search-engine-tool', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="Simple-End-to-End-Example-of-a-Search-Engine-Tool"><a href="#Simple-End-to-End-Example-of-a-Search-Engine-Tool" class="headerlink" title="Simple End-to-End Example of a Search Engine Tool"></a>Simple End-to-End Example of a Search Engine Tool</h2><p>Below is a minimal Google SERP fetcher that uses Crawlbase Smart AI Proxy as the only outbound path to Google. It shows how to:</p><ol><li>Configure the proxy with your token or the <a href="https://crawlbase.com/dashboard/smartproxy">Proxy Authentication key</a> (passed via <code>CRAWLBASE_TOKEN</code>).</li><li>Send a GET request to a <a href="https://www.google.com/search?q=best%20coffee%20shops%20Paris">Google Search URL</a>.</li><li>Pass <code>CrawlbaseAPI-Parameters: autoparse=true</code> so the response is structured JSON (no HTML parsing). You get <code>original_status</code>, <code>pc_status</code>, <code>url</code>, and body with <code>searchResults</code>, <code>ads</code>, <code>snackPack</code>, and <code>peopleAlsoAsk</code>.</li></ol><p>We omit the country parameter so the snippet works without the <a href="https://crawlbase.com/smart-proxy#pricing">Premium plan</a>.</p><h3 id="Code-Snippet-Google-SERP-fetcher-in-Python"><a href="#Code-Snippet-Google-SERP-fetcher-in-Python" class="headerlink" title="Code Snippet: Google SERP fetcher in Python"></a>Code Snippet: Google SERP fetcher in Python</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Fetches a Google SERP via Crawlbase Smart AI Proxy.</span></span><br><span class="line"><span class="comment"># Requires: pip install requests</span></span><br><span class="line"><span class="keyword">import</span> json</span><br><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> urllib.parse <span class="keyword">import</span> quote_plus</span><br><span class="line"><span class="keyword">from</span> urllib3.exceptions <span class="keyword">import</span> InsecureRequestWarning</span><br><span class="line">requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">fetch_google_serp</span>(<span class="params">crawlbase_token: <span class="built_in">str</span>, query: <span class="built_in">str</span></span>) -&gt; <span class="built_in">dict</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    Fetch Google SERP for the given query using Smart AI Proxy.</span></span><br><span class="line"><span class="string">    Uses autoparse=true; response is JSON. Returns parsed dict (original_status, pc_status, url, body).</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    proxy_https = <span class="string">f&quot;https://<span class="subst">&#123;crawlbase_token&#125;</span>:@smartproxy.crawlbase.com:8013&quot;</span></span><br><span class="line">    proxies = &#123;<span class="string">&quot;http&quot;</span>: proxy_https, <span class="string">&quot;https&quot;</span>: proxy_https&#125;</span><br><span class="line">    encoded_query = quote_plus(query)</span><br><span class="line">    url = <span class="string">f&quot;https://www.google.com/search?q=<span class="subst">&#123;encoded_query&#125;</span>&quot;</span></span><br><span class="line">    headers = &#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;autoparse=true&quot;</span>&#125;</span><br><span class="line">    response = requests.get(url, headers=headers, proxies=proxies, verify=<span class="literal">False</span>, timeout=<span class="number">30</span>)</span><br><span class="line">    response.raise_for_status()</span><br><span class="line">    <span class="keyword">return</span> json.loads(response.text)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line">    crawlbase_token = os.environ.get(<span class="string">&quot;CRAWLBASE_TOKEN&quot;</span>, <span class="string">&quot;YOUR_CRAWLBASE_TOKEN&quot;</span>)</span><br><span class="line">    data = fetch_google_serp(crawlbase_token, <span class="string">&quot;best coffee shops Paris&quot;</span>)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;Fetched keys: <span class="subst">&#123;<span class="built_in">list</span>(data.keys())&#125;</span>&quot;</span>)</span><br></pre></td></tr></table></figure><p>This snippet can run behind an API, worker queue, or scheduled job and serves as the data collection backbone of a larger system.</p><h3 id="Additional-Production-of-Search-Engine-Tool-Examples"><a href="#Additional-Production-of-Search-Engine-Tool-Examples" class="headerlink" title="Additional Production of Search Engine Tool Examples"></a>Additional Production of Search Engine Tool Examples</h3><h4 id="Bing-SERP-Fetcher-Normalized-to-Google-Format"><a href="#Bing-SERP-Fetcher-Normalized-to-Google-Format" class="headerlink" title="Bing SERP Fetcher (Normalized to Google Format)"></a>Bing SERP Fetcher (Normalized to Google Format)</h4><p>Crawlbase offers a <a href="https://crawlbase.com/docs/crawling-api/scrapers/#bing-serp">Bing SERP parameter</a> that can return structured results directly, but the example here takes a different route on purpose. Instead of relying on structured output, it pulls the raw HTML through Smart AI Proxy and parses it locally with <a href="https://pypi.org/project/beautifulsoup4/">BeautifulSoup</a>. This makes the logic transparent and easier to customize if you need fields that standard parsers do not expose.</p><p>Highlights of this implementation:</p><p>• Uses the same proxy setup as the Google fetcher<br>• Retrieves the standard Bing results page<br>• Parses content locally instead of relying on autoparse<br>• Produces output compatible with the Google schema<br>• Easy to modify if Bing’s layout changes</p><p><a href="https://github.com/ScraperHub/build-a-search-engine-tool-with-smart-ai-proxy/blob/main/bing_serp_fetcher.py">→ View the full Bing SERP fetcher on GitHub</a></p><h4 id="Unified-Google-Bing-SERP-Fetcher-Single-Interface"><a href="#Unified-Google-Bing-SERP-Fetcher-Single-Interface" class="headerlink" title="Unified Google + Bing SERP Fetcher (Single Interface)"></a>Unified Google + Bing SERP Fetcher (Single Interface)</h4><p>Most real systems do not depend on a single search engine. Traffic patterns change, availability varies, and different engines surface different information. The unified fetcher wraps both implementations behind one function so the rest of your application can treat them as interchangeable data sources.</p><p>The wrapper calls the appropriate fetcher, validates the response, and returns a normalized structure. Because the output shape is consistent, switching engines does not require changes to storage, ranking logic, or APIs.</p><p>This is the piece that turns separate scripts into something closer to production infrastructure.</p><p>What it handles:</p><p>• Choosing the search engine at runtime<br>• Verifying the response before processing<br>• Normalizing ads and organic results into the same format<br>• Returning a predictable structure every time<br>• Plugging cleanly into workers, APIs, or batch jobs</p><p><a href="https://github.com/ScraperHub/build-a-search-engine-tool-with-smart-ai-proxy/blob/main/serp_fetcher.py">→ View the unified SERP fetcher on GitHub</a></p><p><a href="https://github.com/ScraperHub/build-a-search-engine-tool-with-smart-ai-proxy">→ Full example: Google + Bing SERP fetchers, unified API, normalized JSON</a></p><h2 id="How-Do-You-Scale-a-SERP-Scraper-Without-Breaking-It"><a href="#How-Do-You-Scale-a-SERP-Scraper-Without-Breaking-It" class="headerlink" title="How Do You Scale a SERP Scraper Without Breaking It?"></a>How Do You Scale a SERP Scraper Without Breaking It?</h2><p>Scaling requires coordination across concurrency, geography, cost management, and reliability.</p><h3 id="Concurrency"><a href="#Concurrency" class="headerlink" title="Concurrency"></a>Concurrency</h3><p>Use a job queue with multiple workers issuing requests through the same proxy endpoint. Rotation distributes traffic across independent routes.</p><h3 id="Geo-and-device-variation"><a href="#Geo-and-device-variation" class="headerlink" title="Geo and device variation"></a>Geo and device variation</h3><p>If you need regional data, vary location parameters across requests. The same query can produce very different results depending on where it appears to originate.</p><h3 id="Rate-and-cost-control"><a href="#Rate-and-cost-control" class="headerlink" title="Rate and cost control"></a>Rate and cost control</h3><p>Even with a proxy layer, unbounded traffic can create unnecessary failures or expense. Simple throttling on the client side usually solves this.</p><h3 id="Resilience"><a href="#Resilience" class="headerlink" title="Resilience"></a>Resilience</h3><p>Expect occasional transient errors. Retry with backoff and monitor status codes so temporary issues do not cascade into larger failures.</p><h2 id="Why-Use-Crawlbase-for-Large-Scale-SERP-Data-Collection"><a href="#Why-Use-Crawlbase-for-Large-Scale-SERP-Data-Collection" class="headerlink" title="Why Use Crawlbase for Large-Scale SERP Data Collection"></a>Why Use Crawlbase for Large-Scale SERP Data Collection</h2><p>At scale, consistency matters more than peak performance. Occasional success is easy; sustained reliability is not. Smart AI Proxy provides a stable access layer without requiring you to operate your own proxy infrastructure.</p><p>Practical advantages include:</p><p>• Designed for sustained automated traffic<br>• No proxy pool maintenance<br>• Compatible with standard HTTP clients<br>• Centralized routing and mitigation<br>• Reusable across different crawling tasks</p><p>Treating this layer as infrastructure allows teams to focus on product features rather than connectivity problems.</p><h2 id="Next-Steps"><a href="#Next-Steps" class="headerlink" title="Next Steps"></a>Next Steps</h2><p>If you want to turn this from a demo into something you can actually rely on, the process is straightforward:</p><ol><li>Create a <a href="https://crawlbase.com/signup?signup=blog">Crawlbase account</a> and get your authentication key</li><li>Store the token in your environment variables or application config</li><li>Run the fetcher with a few real queries to confirm everything works from your setup</li><li>Adjust the normalization step so you keep only the data your product needs</li><li>Deploy the fetch component behind a queue worker, API endpoint, or scheduled task</li></ol><p>After this, the problem shifts from “How do we keep this scraper alive?” to “What do we want to do with the data?” Requests continue to flow, results stay consistent, and your team can focus on ranking, analysis, or product features instead of fighting blocks and CAPTCHA.</p><p>If you are unsure whether it will hold up for your use case, the quickest way to decide is to test it with your own queries. Crawlbase includes <a href="https://crawlbase.com/smart-proxy">5,000 free Smart AI Proxy requests</a>, which is enough to observe real behavior under load without changing your existing architecture.</p><p><a href="https://crawlbase.com/signup?signup=smartproxy">Sign up now</a>, get your token, and run a few searches through the proxy to evaluate it with real data.</p>]]></content>
    
    
    <summary type="html">&lt;div class=&quot;callout-banner&quot;&gt;
  &lt;div class=&quot;banner-header&quot;&gt;
    &lt;img src=&quot;/blog/images/flashlight-icon-blue.png&quot; srcset=&quot;/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x&quot; alt=&quot;Flashlight Icon&quot;/&gt;
    &lt;h2 class=&quot;banner-header-label&quot;&gt;Quick Answer&lt;/h2&gt;
  &lt;/div&gt;
  &lt;p class=&quot;banner-body&quot;&gt;You can build a production-grade search engine or SERP data tool by routing all search requests through Crawlbase Smart AI Proxy, which provides rotating IPs, geo targeting, and anti-bot mitigation behind a standard proxy interface.&lt;/p&gt;
&lt;/div&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="search engine optimization tool" scheme="https://crawlbase.com/blog/tags/search-engine-optimization-tool/"/>
    
    <category term="search engine tool" scheme="https://crawlbase.com/blog/tags/search-engine-tool/"/>
    
    <category term="how to build search engine tool" scheme="https://crawlbase.com/blog/tags/how-to-build-search-engine-tool/"/>
    
  </entry>
  
  <entry>
    <title>VPN vs. AI Proxy for Web Scraping - Which Works Better in 2026</title>
    <link href="https://crawlbase.com/blog/difference-between-vpn-and-ai-proxy/"/>
    <id>https://crawlbase.com/blog/difference-between-vpn-and-ai-proxy/</id>
    <published>2026-02-23T22:13:07.000Z</published>
    <updated>2026-04-24T11:53:23.103Z</updated>
    
    <content type="html"><![CDATA[<p>AI proxies perform better than VPNs for web scraping in 2026. If you’re sending a few hundred requests to basic targets, a VPN can suffice. However, for large-scale scraping, AI proxies are clearly the better choice, and here’s why that matters.</p><span id="more"></span><p>VPNs route all traffic through one static IP meant for private browsing. Anti-bot systems keep updated lists of known VPN IP ranges, so they quickly flag and block automated traffic, often within just a few requests. They do not provide IP rotation, fingerprint management, or adjustments to site defenses.</p><p>AI-powered rotating proxies, like <a href="https://crawlbase.com/smart-proxy?signup=blog">Crawlbase Smart AI Proxy</a>, are designed to avoid IP blocks and bypass anti-bot detection. Unlike VPNs, they change identities for each request, spoof browser fingerprints, and adapt to new defenses in real time. The outcome is scraping jobs that run continuously without interruptions, even against highly protected targets.</p><table><thead><tr><th>Capability</th><th>VPN</th><th>AI Proxy</th></tr></thead><tbody><tr><td>IP Rotation</td><td>❌ Single static IP</td><td>✅ Per-request rotation</td></tr><tr><td>IP Pool Size</td><td>❌ Small, shared</td><td>✅ Large, refreshed constantly</td></tr><tr><td>Fingerprint Management</td><td>❌ None</td><td>✅ Managed automatically</td></tr><tr><td>CAPTCHA Handling</td><td>❌ Not supported</td><td>✅ Built-in mitigation</td></tr><tr><td>Anti-Bot Bypass</td><td>❌ Easily detected</td><td>✅ Adaptive &amp; real-time</td></tr><tr><td>Scalability</td><td>❌ Low</td><td>✅ High concurrency</td></tr><tr><td>Best For</td><td>Low-volume, simple targets</td><td>Production scraping at scale</td></tr></tbody></table><p>If your crawler works during testing but fails in production, the issue is usually with the network layer, not your code. Choosing infrastructure designed for automation sets apart stable data pipelines from constant block-fighting.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating IPs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Why-Teams-Initially-Choose-VPNs-for-Web-Scraping"><a href="#Why-Teams-Initially-Choose-VPNs-for-Web-Scraping" class="headerlink" title="Why Teams Initially Choose VPNs for Web Scraping"></a>Why Teams Initially Choose VPNs for Web Scraping</h2><p>Using a <a href="https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-vpn#vpndefenition">VPN</a> feels like the simplest way to avoid IP blocks. You connect to a server in another country, and your requests now appear to originate from there. No code changes are required, and most developers already understand how VPN clients work.</p><p>Typical reasons teams start here:</p><p>• Quick setup with no infrastructure planning<br>• Low upfront cost compared to proxy services<br>• Ability to test geo-restricted content immediately<br>• Works for manual checks and small scripts<br>• Familiar tool already used for remote access</p><p>For early prototypes, this can appear to solve the problem. A script that sends a few dozen requests may work perfectly, which creates the impression that scaling is just a matter of running it more often.</p><p>The trouble begins when the traffic stops looking like a person browsing a website.</p><h2 id="The-Breaking-Point-Why-VPNs-Fail-for-Automated-Scraping"><a href="#The-Breaking-Point-Why-VPNs-Fail-for-Automated-Scraping" class="headerlink" title="The Breaking Point: Why VPNs Fail for Automated Scraping"></a>The Breaking Point: Why VPNs Fail for Automated Scraping</h2><p>VPN networks are optimized for interactive sessions like opening pages, watching videos, and sending emails. Automated scraping produces a completely different traffic profile: rapid, repetitive, and often parallel.</p><p>Most commercial VPN providers operate relatively small pools of IP addresses that are shared among thousands of users. Those addresses accumulate a reputation over time. Once scraping activity starts, the reputation deteriorates quickly.</p><h3 id="Common-failure-patterns-include"><a href="#Common-failure-patterns-include" class="headerlink" title="Common failure patterns include:"></a>Common failure patterns include:</h3><p>• 403 Forbidden or “access denied” responses<br>• CAPTCHA challenges that block automation<br>• Rate limiting after short bursts of traffic<br>• Empty or incomplete HTML responses<br>• Sudden connection resets</p><p>Switching to another VPN server sometimes restores access temporarily, but blocks usually return because the underlying traffic still looks automated.</p><p>In practice, many teams discover that a scraper that worked in the morning stops working by the afternoon.</p><h2 id="Why-Changing-IP-Alone-Is-Not-Enough"><a href="#Why-Changing-IP-Alone-Is-Not-Enough" class="headerlink" title="Why Changing IP Alone Is Not Enough"></a>Why Changing IP Alone Is Not Enough</h2><p>Modern <a href="https://www.radware.com/cyberpedia/bot-management/bot-detection/">anti-bot systems</a> rarely rely on IP address alone. They build a broader profile that combines network reputation, device characteristics, and behavioral signals. Changing servers without changing the rest of that profile does not make you look like a new visitor.</p><p>Signals commonly evaluated include:</p><p>• Reputation of the IP address and the surrounding range<br>• Autonomous System Number (ASN), revealing whether traffic comes from a VPN or datacenter network<br>• Historical abuse reports associated with that provider<br>• TLS fingerprint produced during the HTTPS handshake<br>• HTTP headers and browser signature consistency<br>• Cookie usage patterns across requests<br>• Timing and concurrency patterns inconsistent with human behavior</p><p>VPN endpoints typically perform poorly on these metrics. Their IP ranges are well-known, heavily reused, and frequently flagged by threat-intelligence systems. Even if you connect to a different server, you are still coming from the same provider’s network with the same client fingerprint.</p><p>To a detection system, this looks less like a new user and more like the same automated process trying to evade controls.</p><h2 id="How-AI-Powered-Proxies-Solve-These-Problems"><a href="#How-AI-Powered-Proxies-Solve-These-Problems" class="headerlink" title="How AI-Powered Proxies Solve These Problems"></a>How AI-Powered Proxies Solve These Problems</h2><p><a href="https://crawlbase.com/blog/what-is-an-ai-proxy/">AI proxies</a> treat each request as a managed session rather than a simple network hop. Instead of exposing raw infrastructure, they orchestrate identity, routing, and mitigation dynamically.</p><p>Core capabilities typically include:</p><p>• Large pools of residential and datacenter IPs<br>• Automatic rotation per request or session<br>• Adaptive routing based on block signals<br>• Fingerprint normalization<br>• Integrated CAPTCHA handling<br>• Concurrency management</p><p>The key difference is automation. Engineers no longer need to monitor IP rotations and intervene manually.</p><h2 id="VPN-vs-AI-Proxy-Full-Side-by-Side-Comparison"><a href="#VPN-vs-AI-Proxy-Full-Side-by-Side-Comparison" class="headerlink" title="VPN vs. AI Proxy: Full Side-by-Side Comparison"></a>VPN vs. AI Proxy: Full Side-by-Side Comparison</h2><table><thead><tr><th>Capability</th><th>VPN</th><th>AI Proxy</th></tr></thead><tbody><tr><td>IP Rotation</td><td>❌ Manual server switching</td><td>✅ Automatic per request</td></tr><tr><td>IP Pool Size</td><td>❌ Small, shared</td><td>✅ Large, continuously refreshed</td></tr><tr><td>Fingerprint Management</td><td>❌ None</td><td>✅ Managed automatically</td></tr><tr><td>CAPTCHA Handling</td><td>❌ Not supported</td><td>✅ Built-in mitigation</td></tr><tr><td>Cloudflare Bypass</td><td>❌ Frequently blocked</td><td>✅ Adaptive mitigation</td></tr><tr><td>Scalability</td><td>❌ Low</td><td>✅ High concurrency</td></tr><tr><td>Reliability</td><td>❌ Unpredictable</td><td>✅ Consistent success rates</td></tr><tr><td>Automation Readiness</td><td>❌ Poor</td><td>✅ Designed for bots</td></tr><tr><td>JavaScript Rendering</td><td>❌ Not supported</td><td>✅ Optional headless browser</td></tr><tr><td>Best For</td><td>Manual checks, small scripts</td><td>Production pipelines at scale</td></tr></tbody></table><p>For production scraping, these differences directly affect uptime, engineering effort, and operational cost.</p><h2 id="Code-Comparison-VPN-vs-AI-Proxy-Implementation"><a href="#Code-Comparison-VPN-vs-AI-Proxy-Implementation" class="headerlink" title="Code Comparison: VPN vs. AI Proxy Implementation"></a>Code Comparison: VPN vs. AI Proxy Implementation</h2><p>The application code for both approaches can look similar. The difference lies in what happens outside your script.</p><h3 id="Scraping-with-a-VPN"><a href="#Scraping-with-a-VPN" class="headerlink" title="Scraping with a VPN"></a>Scraping with a VPN</h3><p>Your program sends requests normally while the operating system routes traffic through the VPN.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="comment"># Assuming VPN is connected at OS level</span></span><br><span class="line"><span class="comment"># All traffic automatically routes through VPN</span></span><br><span class="line">target_url = <span class="string">&quot;https://www.amazon.com/dp/B08N5WRWNW&quot;</span></span><br><span class="line"><span class="keyword">try</span>:</span><br><span class="line">    response = requests.get(target_url, timeout=<span class="number">30</span>)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;Status: <span class="subst">&#123;response.status_code&#125;</span>&quot;</span>)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;Content length: <span class="subst">&#123;<span class="built_in">len</span>(response.text)&#125;</span>&quot;</span>)</span><br><span class="line"><span class="keyword">except</span> Exception <span class="keyword">as</span> e:</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;Error: <span class="subst">&#123;e&#125;</span>&quot;</span>)</span><br></pre></td></tr></table></figure><p>Typical outcomes after repeated requests:</p><p>• 403 Forbidden responses<br>• CAPTCHA pages instead of real content<br>• Connection throttling<br>• Need to manually switch servers</p><p>Operational burden grows quickly because the system cannot recover automatically.</p><h3 id="Scraping-with-Crawlbase-Smart-AI-Proxy"><a href="#Scraping-with-Crawlbase-Smart-AI-Proxy" class="headerlink" title="Scraping with Crawlbase Smart AI Proxy"></a>Scraping with Crawlbase Smart AI Proxy</h3><p>Crawlbase Smart AI Proxy routes each request through managed infrastructure optimized for scraping workloads.</p><p>Getting started requires only your access token, which is available in your <a href="https://crawlbase.com/dashboard/smartproxy">Smart AI Proxy account dashboard</a> after signing up. Once you have the token, you use it as the proxy authentication credential in your requests.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> urllib3.exceptions <span class="keyword">import</span> InsecureRequestWarning</span><br><span class="line"><span class="comment"># Suppress SSL warnings</span></span><br><span class="line">requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)</span><br><span class="line"><span class="comment"># Your Crawlbase access token</span></span><br><span class="line">ACCESS_TOKEN = <span class="string">&quot;YOUR_ACCESS_TOKEN&quot;</span></span><br><span class="line"><span class="comment"># Target URL</span></span><br><span class="line">target_url = <span class="string">&quot;https://www.amazon.com/dp/B08N5WRWNW&quot;</span></span><br><span class="line"><span class="comment"># Configure Smart AI Proxy</span></span><br><span class="line">proxy_url = <span class="string">f&quot;http://<span class="subst">&#123;ACCESS_TOKEN&#125;</span>:@smartproxy.crawlbase.com:8012&quot;</span></span><br><span class="line">proxies = &#123;</span><br><span class="line">    <span class="string">&quot;http&quot;</span>: proxy_url,</span><br><span class="line">    <span class="string">&quot;https&quot;</span>: proxy_url</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">try</span>:</span><br><span class="line">    response = requests.get(</span><br><span class="line">        url=target_url,</span><br><span class="line">        proxies=proxies,</span><br><span class="line">        verify=<span class="literal">False</span>,</span><br><span class="line">        timeout=<span class="number">30</span></span><br><span class="line">    )</span><br><span class="line">    response.raise_for_status()</span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;✓ Status: <span class="subst">&#123;response.status_code&#125;</span>&quot;</span>)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;✓ Content length: <span class="subst">&#123;<span class="built_in">len</span>(response.text)&#125;</span>&quot;</span>)</span><br><span class="line">    <span class="comment"># Parse your data here</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">except</span> requests.exceptions.RequestException <span class="keyword">as</span> e:</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;Error: <span class="subst">&#123;e&#125;</span>&quot;</span>)</span><br></pre></td></tr></table></figure><p>Expected behavior:</p><p>• Consistent 200 OK responses<br>• Automatic IP rotation<br>• Managed fingerprints<br>• Reduced CAPTCHA interruptions<br>• No manual intervention</p><h3 id="Handling-JavaScript-heavy-pages"><a href="#Handling-JavaScript-heavy-pages" class="headerlink" title="Handling JavaScript-heavy pages"></a>Handling JavaScript-heavy pages</h3><p>Many modern sites render content dynamically. You can enable <a href="https://crawlbase.com/docs/smart-proxy/headless-browsers/">browser rendering</a> through request parameters.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Custom headers for JavaScript rendering</span></span><br><span class="line">headers = &#123;</span><br><span class="line">    <span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;javascript=true&quot;</span></span><br><span class="line">&#125;</span><br><span class="line">response = requests.get(</span><br><span class="line">    url=target_url,</span><br><span class="line">    proxies=proxies,</span><br><span class="line">    headers=headers,</span><br><span class="line">    verify=<span class="literal">False</span>,</span><br><span class="line">    timeout=<span class="number">30</span></span><br><span class="line">)</span><br></pre></td></tr></table></figure><h3 id="Advanced-parameter-examples"><a href="#Advanced-parameter-examples" class="headerlink" title="Advanced parameter examples"></a>Advanced parameter examples</h3><p>Crawlbase allows fine-grained control without infrastructure changes via <a href="https://crawlbase.com/docs/smart-proxy/parameters/">request parameters</a>.</p><h4 id="Geo-targeting"><a href="#Geo-targeting" class="headerlink" title="Geo-targeting:"></a>Geo-targeting:</h4><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">headers = &#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;country=US&quot;</span>&#125;</span><br></pre></td></tr></table></figure><h4 id="Mobile-emulation"><a href="#Mobile-emulation" class="headerlink" title="Mobile emulation:"></a>Mobile emulation:</h4><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">headers = &#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;device=mobile&quot;</span>&#125;</span><br></pre></td></tr></table></figure><h4 id="Retrieve-headers-and-cookies"><a href="#Retrieve-headers-and-cookies" class="headerlink" title="Retrieve headers and cookies:"></a>Retrieve headers and cookies:</h4><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">headers = &#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;get_headers=true&amp;get_cookies=true&quot;</span>&#125;</span><br></pre></td></tr></table></figure><h4 id="Store-results-in-Crawlbase-Cloud-Storage"><a href="#Store-results-in-Crawlbase-Cloud-Storage" class="headerlink" title="Store results in Crawlbase Cloud Storage:"></a>Store results in Crawlbase Cloud Storage:</h4><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">headers = &#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;store=true&quot;</span>&#125;</span><br></pre></td></tr></table></figure><h4 id="Combine-parameters"><a href="#Combine-parameters" class="headerlink" title="Combine parameters:"></a>Combine parameters:</h4><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">headers = &#123;</span><br><span class="line">    <span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: <span class="string">&quot;javascript=true&amp;country=US&amp;device=mobile&amp;store=true&quot;</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>These controls operate at request level, enabling precise data collection strategies without rewriting core logic.</p><p>You can find the complete working examples in our <a href="https://github.com/ScraperHub/vpn-vs-ai-proxy-for-web-scraping">GitHub repository</a>.</p><h2 id="Why-Teams-Choose-Crawlbase-Smart-AI-Proxy"><a href="#Why-Teams-Choose-Crawlbase-Smart-AI-Proxy" class="headerlink" title="Why Teams Choose Crawlbase Smart AI Proxy"></a>Why Teams Choose Crawlbase Smart AI Proxy</h2><p>Crawlbase Smart AI Proxy acts as a managed access layer rather than a static proxy pool. You send requests to a single endpoint, and the platform determines how to deliver them successfully.</p><p>Key characteristics:</p><p>• Unified endpoint for residential and datacenter routes<br>• Automatic selection of IPs based on performance<br>• Built-in mitigation when targets begin blocking<br>• Geographic targeting across many countries<br>• Optional browser rendering</p><h3 id="Built-for-concurrent-workloads"><a href="#Built-for-concurrent-workloads" class="headerlink" title="Built for concurrent workloads"></a>Built for concurrent workloads</h3><p>Large scraping jobs require parallel execution. Collecting thousands of pages sequentially is rarely practical.</p><p>Crawlbase supports concurrency through a thread model:</p><p>• Starter plans support 20 concurrent threads<br>• Premium plans support up to 80 concurrent threads<br>• Higher limits are available through custom packages</p><p>This allows multiple requests to run simultaneously, enabling tasks such as catalog monitoring or multi-region data collection to complete in a reasonable time frame.</p><p>If additional capacity is needed, thread limits can be increased without redesigning the application. You can review the available tiers on the <a href="https://crawlbase.com/smart-proxy#pricing">Smart AI Proxy pricing page</a> to determine which level matches your workload.</p><h3 id="Reduced-operational-overhead"><a href="#Reduced-operational-overhead" class="headerlink" title="Reduced operational overhead"></a>Reduced operational overhead</h3><p>Managing your own proxy network involves constant monitoring, routing adjustments, and ban recovery. Crawlbase handles these tasks internally, so teams can concentrate on processing the data instead of maintaining access.</p><p>For organizations without dedicated scraping engineers, this often determines whether a project is sustainable.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'difference-between-vpn-and-ai-proxy', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="Making-the-Right-Choice-for-Your-Project"><a href="#Making-the-Right-Choice-for-Your-Project" class="headerlink" title="Making the Right Choice for Your Project"></a>Making the Right Choice for Your Project</h2><p>Use a VPN only for:</p><p>• Manual browsing tests<br>• Verifying geo-restricted content<br>• Low-volume experiments</p><p>Use an AI proxy for:</p><p>• Production data pipelines<br>• Large-scale crawling<br>• Competitive intelligence gathering<br>• SEO monitoring across regions<br>• E-commerce price tracking<br>• Any workload requiring reliability</p><p>While AI proxies typically cost more than consumer VPNs, the difference is often outweighed by reduced engineering time, fewer failed runs, and the ability to scale without constant maintenance.</p><p>If your current setup regularly encounters blocks, CAPTCHA, or unstable results, moving to infrastructure designed for automated data collection can save significant time and effort.</p><p><a href="https://crawlbase.com/signup?signup=smartproxy">Sign up for Crawlbase now</a> to start testing with real workloads and see how a purpose-built AI proxy performs at scale. You can begin with smaller jobs and expand as your data needs grow, without redesigning your scraping architecture.</p><h2 id="Frequently-asked-questions"><a href="#Frequently-asked-questions" class="headerlink" title="Frequently asked questions"></a>Frequently asked questions</h2><h3 id="Can-you-legally-use-a-VPN-for-web-scraping"><a href="#Can-you-legally-use-a-VPN-for-web-scraping" class="headerlink" title="Can you legally use a VPN for web scraping?"></a>Can you legally use a VPN for web scraping?</h3><p>Legality depends on your jurisdiction and the target site’s terms of service — not the networking tool itself. Both VPNs and proxies are simply methods of routing traffic. What matters legally is what data you collect, how you use it, and whether you are violating a site’s ToS or applicable data protection laws such as GDPR or CCPA. Always consult legal guidance before scraping sensitive or personal data.</p><h3 id="What-is-the-difference-between-a-proxy-and-a-VPN-for-scraping"><a href="#What-is-the-difference-between-a-proxy-and-a-VPN-for-scraping" class="headerlink" title="What is the difference between a proxy and a VPN for scraping?"></a>What is the difference between a proxy and a VPN for scraping?</h3><p>A VPN routes all device traffic through a single remote server, giving you one IP address for all requests with no rotation capability. A proxy, by contrast, routes individual requests and can be configured to use many different endpoints. AI-powered rotating proxies go further still — they automate IP rotation per request, normalize browser fingerprints, handle CAPTCHAs, and adapt routing based on live block signals. For scraping, this makes AI proxies significantly more effective than either standard proxies or VPNs.</p><h3 id="Do-you-need-a-proxy-for-web-scraping"><a href="#Do-you-need-a-proxy-for-web-scraping" class="headerlink" title="Do you need a proxy for web scraping?"></a>Do you need a proxy for web scraping?</h3><p>For small projects targeting simple, unprotected sites, direct connections may work. But for any meaningful scale, or any site using rate limiting, bot detection, or Cloudflare protection, proxy infrastructure is essential. Without it, your scraper’s IP will be flagged and blocked quickly, often within 50 to 200 requests on well-protected targets. Residential rotating proxies or AI proxies are the standard solution for production scraping in 2026.</p><h3 id="How-much-does-an-AI-proxy-cost-compared-to-a-VPN"><a href="#How-much-does-an-AI-proxy-cost-compared-to-a-VPN" class="headerlink" title="How much does an AI proxy cost compared to a VPN?"></a>How much does an AI proxy cost compared to a VPN?</h3><p>Consumer VPNs typically cost between $3 and $12 per month. AI proxy services like Crawlbase are priced based on request volume and features, which makes them more expensive upfront. However, the true cost comparison must account for hidden VPN costs: engineering time spent manually rotating servers, downtime from blocks, failed scraping runs that need to be restarted, and the ongoing operational overhead of maintaining access. For teams running production pipelines, AI proxies are almost always more cost-effective in total.</p><h3 id="What-is-the-best-proxy-for-web-scraping"><a href="#What-is-the-best-proxy-for-web-scraping" class="headerlink" title="What is the best proxy for web scraping?"></a>What is the best proxy for web scraping?</h3><p>In 2026, AI-powered rotating proxies like Crawlbase Smart AI Proxy consistently outperform general-purpose proxies for production scraping. They combine automatic IP rotation, fingerprint management, and CAPTCHA bypass, making them the most reliable option for large-scale, uninterrupted data collection.</p><h3 id="What-is-the-best-way-to-avoid-IP-blocks-when-scraping"><a href="#What-is-the-best-way-to-avoid-IP-blocks-when-scraping" class="headerlink" title="What is the best way to avoid IP blocks when scraping?"></a>What is the best way to avoid IP blocks when scraping?</h3><p>In 2026, avoiding IP blocks requires more than just rotating IPs. Effective block avoidance combines residential IP rotation per request, browser fingerprint normalization (TLS, HTTP headers, cookies), human-like request timing, CAPTCHA handling, and adaptive routing that responds to block signals in real time. AI-powered proxy services handle all of these automatically. Using a VPN alone addresses none of them, which is why VPN-based scrapers fail consistently on protected targets.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;AI proxies perform better than VPNs for web scraping in 2026. If you’re sending a few hundred requests to basic targets, a VPN can suffice. However, for large-scale scraping, AI proxies are clearly the better choice, and here’s why that matters.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="ai proxy for web scraping" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-web-scraping/"/>
    
    <category term="ai-powered proxies" scheme="https://crawlbase.com/blog/tags/ai-powered-proxies/"/>
    
    <category term="vpn for web scraping" scheme="https://crawlbase.com/blog/tags/vpn-for-web-scraping/"/>
    
  </entry>
  
  <entry>
    <title>What Is An AI Proxy?</title>
    <link href="https://crawlbase.com/blog/what-is-an-ai-proxy/"/>
    <id>https://crawlbase.com/blog/what-is-an-ai-proxy/</id>
    <published>2026-02-13T18:37:02.000Z</published>
    <updated>2026-04-24T11:53:24.031Z</updated>
    
    <content type="html"><![CDATA[<div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Quick Answer</h2>  </div>  <p class="banner-body">An <strong>AI proxy</strong> is a smart proxy service that uses machine learning to adjust IP rotation patterns, fingerprinting, and request strategies in real-time based on how target websites respond. This helps maintain high success rates against modern <strong>anti-bot systems.</strong> AI proxies are commonly used for large-scale web scraping, LLM data collection, and automated browsing workflows.</p></div><span id="more"></span><p>An AI proxy is a type of proxy that employs machine learning to change its behavior based on how a target site reacts. Unlike traditional proxies that follow fixed rules, <a href="https://crawlbase.com/blog/unblock-data-with-smart-ai-proxy/">AI proxies</a> learn continuously and modify IP rotation patterns, fingerprinting methods, request routing, and strategies to avoid being blocked in real-time.</p><p>​AI proxies are crucial because modern anti-bot systems have become more advanced. Websites use behavioral analysis, fingerprinting, and rate-limiting methods to detect and block traditional proxy patterns. Static, rule-based systems struggle to keep pace with these changes, but machine learning can adapt.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'what-is-an-ai-proxy', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="​Why-Traditional-Smart-Proxies-Fall-Short"><a href="#​Why-Traditional-Smart-Proxies-Fall-Short" class="headerlink" title="​Why Traditional Smart Proxies Fall Short"></a>​Why Traditional Smart Proxies Fall Short</h2><p>Traditional smart proxies work using set rules, like rotating IPs after a certain number of requests or using specific user agents. Engineers create these rules based on past blocking patterns.</p><p>The problem is that anti-bot systems evolve more quickly than manual rule updates can occur. A rotation pattern that is effective today may trigger blocks tomorrow. <a href="https://www.cloudflare.com/en-gb/learning/cdn/glossary/reverse-proxy/">Traditional proxies</a> react to blocks only after they happen, leading to an ongoing cycle where you’ll always be one step behind.</p><p><strong>​Key limitations include</strong>:</p><p>​• Predictable static rotation patterns<br>• No adjustment to the site-specific blocking logic<br>• Manual rule updates that lag behind anti-bot changes<br>• Limited ability to spot early signs of detection</p><h3 id="​How-AI-Powered-Proxies-Work"><a href="#​How-AI-Powered-Proxies-Work" class="headerlink" title="​How AI-Powered Proxies Work"></a>​How AI-Powered Proxies Work</h3><p>AI proxies use machine learning models trained on millions of request-response pairs. The system examines:</p><p>• Response patterns, such as status codes, headers, and timing<br>• Success and failure connections across IP pools<br>• Specific blocking patterns for each site<br>• Historical performance data for each domain</p><p>​The AI layer operates between your requests and the proxy network, making real-time decisions about:</p><p>• Which IP to use for a specific request<br>• When to rotate based on the site’s current behavior<br>• How to modify fingerprints for the target site<br>• Whether to apply delays or change routing</p><p>As requests are processed, the system continuously updates its models, learning which strategies are most effective for each domain and adapting as anti-bot measures change.</p><table><thead><tr><th>Feature</th><th>Traditional Smart Proxy</th><th>AI Proxy</th></tr></thead><tbody><tr><td>Rotation Logic</td><td>Fixed rules (every N requests)</td><td>Dynamic, based on site behavior</td></tr><tr><td>Adaptation Speed</td><td>Manual updates (days to weeks)</td><td>Real-time (milliseconds)</td></tr><tr><td>Site-Specific Optimization</td><td>Generic approach for all sites</td><td>Learns each target’s patterns</td></tr><tr><td>Fingerprinting</td><td>Preset user agents and headers</td><td>Context-aware fingerprint generation</td></tr><tr><td>Success Rate</td><td>Degrades over time as patterns are detected</td><td>Maintains high rates through continuous learning</td></tr><tr><td>Blocking Prevention</td><td>Reactive (after blocks occur)</td><td>Proactive (detects early warning signs)</td></tr></tbody></table><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating IPs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Common-AI-Proxy-Use-Cases"><a href="#Common-AI-Proxy-Use-Cases" class="headerlink" title="Common AI Proxy Use Cases"></a>Common AI Proxy Use Cases</h2><p>AI proxies are particularly effective in situations where blocking patterns frequently change or vary by target:</p><ul><li><strong>E-commerce price monitoring</strong>: Track competitor pricing across different sites with various anti-bot systems. AI adapts to each retailer’s unique defenses without needing manual setup.</li><li><strong>Market research</strong>: Scrape reviews, ratings, and product data at scale. The AI layer optimizes request patterns to avoid detection while maintaining speed.</li><li><strong>Real estate data collection</strong>: Monitor property listings across multiple platforms. AI manages different rate limits and blocking logic across various MLS systems.</li><li><strong>SEO and SERP tracking</strong>: Collect search rankings without triggering protections from search engines. Machine learning models learn from and adapt to search engine prevention measures.</li><li><strong>Social media monitoring</strong>: Track mentions, trends, and sentiment across platforms that use advanced bot detection. AI modifies behavior based on specific patterns for each platform.</li><li><strong>LLM data collection and AI agents</strong>: Gather fresh web data for training, retrieval-augmented generation (RAG), and autonomous AI workflows without triggering modern bot defenses.</li></ul><h2 id="How-to-Choose-an-AI-Proxy-Solution"><a href="#How-to-Choose-an-AI-Proxy-Solution" class="headerlink" title="How to Choose an AI Proxy Solution"></a>How to Choose an AI Proxy Solution</h2><p>When considering <a href="https://crawlbase.com/blog/best-proxy-providers/">AI proxy providers</a>, take these factors into account:</p><ul><li><strong>​Training data volume</strong>: More request-response pairs lead to better model performance. Ask about the size of their training dataset.</li><li><strong>Domain coverage</strong>: Does the AI have experience with your target sites? Some providers specialize in specific areas, like e-commerce or social media.</li><li><strong>IP pool quality</strong>: AI cannot make up for a poor IP reputation. Ensure they use residential or mobile IPs from trustworthy sources.</li><li><strong>Transparency of success rates</strong>: Look for providers that share actual success rates rather than just marketing claims. Ask for metrics that apply to your specific targets.</li><li><strong>API simplicity</strong>: The proxy should manage complexity behind the scenes. A straightforward API that returns clean HTML or JSON suggests the AI is working well.</li><li><strong>Cost structure</strong>: AI infrastructure can be costly. Unusually low prices often indicate limited AI capabilities or low-quality IPs.</li></ul><h2 id="AI-Powered-Scraping-with-Crawlbase"><a href="#AI-Powered-Scraping-with-Crawlbase" class="headerlink" title="AI-Powered Scraping with Crawlbase"></a>AI-Powered Scraping with Crawlbase</h2><p>Crawlbase <a href="https://crawlbase.com/smart-proxy">Smart AI Proxy</a> is purpose-built for developers and data teams that need reliable, large-scale access to web data. It uses adaptive AI-driven request optimization, intelligent fingerprint management, and automated retry logic to maintain high success rates against modern anti-bot systems.</p><p>Instead of requiring you to set rotation rules or manage IP pools, Crawlbase’s Smart AI Proxy handles the complexity. It selects the best IPs from millions of data center and residential networks, generates appropriate fingerprints, and adjusts timing based on each site’s behavior. You send standard requests, and the proxy returns clean data.</p><p>Crawlbase maintains high success rates across e-commerce sites, social media platforms, search engines, and other heavily protected targets, adapting in real-time as anti-bot systems evolve.</p><h2 id="​AI-Proxy-FAQs"><a href="#​AI-Proxy-FAQs" class="headerlink" title="​AI Proxy FAQs"></a>​AI Proxy FAQs</h2><h3 id="Is-an-AI-proxy-better-for-LLM-data-collection"><a href="#Is-an-AI-proxy-better-for-LLM-data-collection" class="headerlink" title="Is an AI proxy better for LLM data collection?"></a>Is an AI proxy better for LLM data collection?</h3><p>Yes. AI proxies are designed for large-scale, automated data collection workflows needed by modern LLM pipelines. Their adaptive request patterns, fingerprint management, and intelligent IP rotation maintain higher success rates than traditional proxies. Crawlbase Smart AI Proxy is built to handle these AI-specific workflows reliably.</p><h3 id="When-should-developers-use-an-AI-proxy-instead-of-a-rotating-proxy"><a href="#When-should-developers-use-an-AI-proxy-instead-of-a-rotating-proxy" class="headerlink" title="When should developers use an AI proxy instead of a rotating proxy?"></a>When should developers use an AI proxy instead of a rotating proxy?</h3><p>Developers should use AI proxies for heavily protected websites, real-time data pipelines, or AI-driven scraping systems. Unlike traditional rotating proxies, AI proxies automatically adjust request behavior and fingerprints, reducing manual tuning and improving reliability for large-scale web data collection.</p><h3 id="How-do-developers-integrate-an-AI-proxy-into-their-workflow"><a href="#How-do-developers-integrate-an-AI-proxy-into-their-workflow" class="headerlink" title="How do developers integrate an AI proxy into their workflow?"></a>How do developers integrate an AI proxy into their workflow?</h3><p>Integration is simple with Crawlbase Smart AI Proxy. Developers can use standard HTTP&#x2F;S requests or API calls, while the proxy automatically manages IP rotation, fingerprinting, and request timing. This lets engineering teams collect web data at scale without managing complex infrastructure.</p><h3 id="AI-proxy-vs-traditional-proxy-—-what’s-the-difference"><a href="#AI-proxy-vs-traditional-proxy-—-what’s-the-difference" class="headerlink" title="AI proxy vs traditional proxy — what’s the difference?"></a>AI proxy vs traditional proxy — what’s the difference?</h3><p>Traditional proxies use static rules and preset IP rotation, making them vulnerable to advanced bot detection. AI proxies, like Crawlbase Smart AI Proxy, adapt in real-time using machine learning, intelligent fingerprinting, and site-specific optimization, resulting in higher success rates for scraping, AI data pipelines, and LLM training.</p>]]></content>
    
    
    <summary type="html">&lt;div class=&quot;callout-banner&quot;&gt;
  &lt;div class=&quot;banner-header&quot;&gt;
    &lt;img src=&quot;/blog/images/flashlight-icon-blue.png&quot; srcset=&quot;/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x&quot; alt=&quot;Flashlight Icon&quot;/&gt;
    &lt;h2 class=&quot;banner-header-label&quot;&gt;Quick Answer&lt;/h2&gt;
  &lt;/div&gt;
  &lt;p class=&quot;banner-body&quot;&gt;An &lt;strong&gt;AI proxy&lt;/strong&gt; is a smart proxy service that uses machine learning to adjust IP rotation patterns, fingerprinting, and request strategies in real-time based on how target websites respond. This helps maintain high success rates against modern &lt;strong&gt;anti-bot systems.&lt;/strong&gt; AI proxies are commonly used for large-scale web scraping, LLM data collection, and automated browsing workflows.&lt;/p&gt;
&lt;/div&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="ai proxy" scheme="https://crawlbase.com/blog/tags/ai-proxy/"/>
    
    <category term="ai proxy for scraping" scheme="https://crawlbase.com/blog/tags/ai-proxy-for-scraping/"/>
    
    <category term="ai powered proxy" scheme="https://crawlbase.com/blog/tags/ai-powered-proxy/"/>
    
    <category term="ai scraping" scheme="https://crawlbase.com/blog/tags/ai-scraping/"/>
    
  </entry>
  
  <entry>
    <title>Smart AI Proxy vs Oxylabs Proxies - 3 Key Differences That Matter</title>
    <link href="https://crawlbase.com/blog/difference-between-smart-ai-proxy-and-oxylabs-proxies/"/>
    <id>https://crawlbase.com/blog/difference-between-smart-ai-proxy-and-oxylabs-proxies/</id>
    <published>2026-02-13T16:32:51.000Z</published>
    <updated>2026-04-24T11:53:23.103Z</updated>
    
    <content type="html"><![CDATA[<p>The three key differences between Smart AI Proxy and Oxylabs proxies are <strong>automation level, block-handling intelligence, and operational overhead</strong>. Smart AI Proxy works as an AI-managed proxy layer that automatically selects IPs, rotates traffic, applies geolocation, and adapts retry strategies when blocks occur—all through a single endpoint. Oxylabs, in contrast, offers multiple proxy services where you manually choose proxy types, configure sessions, manage targeting for each use case, and implement your own retry logic when requests fail.</p><span id="more"></span><p>In practice, this means <a href="https://crawlbase.com/smart-proxy?signup=blog">Smart AI Proxy</a> prioritizes speed to reliable results with minimal infrastructure work, while Oxylabs prioritizes granular control at the cost of higher setup and ongoing maintenance. Both approaches can succeed at scale, but they require very different levels of engineering involvement.</p><p>This article explains those differences so you can decide which option fits your team, your workload, and the amount of proxy management you want to own.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating IPs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="TLDR-Comparative-Analysis-Between-Crawlbase-Smart-AI-Proxy-and-Oxylabs"><a href="#TLDR-Comparative-Analysis-Between-Crawlbase-Smart-AI-Proxy-and-Oxylabs" class="headerlink" title="TLDR: Comparative Analysis Between Crawlbase Smart AI Proxy and Oxylabs"></a>TLDR: Comparative Analysis Between Crawlbase Smart AI Proxy and Oxylabs</h2><table><thead><tr><th>Feature</th><th>Crawlbase Smart AI Proxy</th><th>Oxylabs</th></tr></thead><tbody><tr><td>Pricing model</td><td>Per credit (success-based)</td><td>Per GB bandwidth (regardless of success)</td></tr><tr><td>Setup effort</td><td>Low</td><td>Medium to high</td></tr><tr><td>Proxy products</td><td>Single AI-managed endpoint</td><td>Multiple separate services</td></tr><tr><td>IP rotation</td><td>Automatic</td><td>User configured</td></tr><tr><td>Block handling</td><td>Built-in with adaptive retries</td><td>Proxy quality plus add-ons</td></tr><tr><td>Geotargeting</td><td>Per request via headers</td><td>Via proxy product or credentials</td></tr><tr><td>Ongoing maintenance</td><td>Minimal</td><td>Continuous</td></tr><tr><td>Best fit</td><td>Lean teams, fast iteration</td><td>Teams needing deep control</td></tr></tbody></table><h2 id="What-Developers-and-Data-Teams-Are-Really-Evaluating"><a href="#What-Developers-and-Data-Teams-Are-Really-Evaluating" class="headerlink" title="What Developers and Data Teams Are Really Evaluating"></a>What Developers and Data Teams Are Really Evaluating</h2><p>When teams <a href="https://crawlbase.com/blog/best-proxy-providers/">compare proxy providers</a>, they rarely care about raw IP counts or marketing claims. The evaluation usually comes down to a small set of practical questions:</p><ul><li>How reliably does this work on blocked or rate-limited targets</li><li>How quickly can you get your first successful scrape running</li><li>How much engineering time is required to keep jobs stable</li><li>How much control do you need versus how much automation you want</li></ul><p>These factors determine not only initial success but also long-term cost. A setup that looks flexible on paper can become expensive if it demands constant tuning and monitoring.</p><h2 id="What-Is-Crawlbase-Smart-AI-Proxy"><a href="#What-Is-Crawlbase-Smart-AI-Proxy" class="headerlink" title="What Is Crawlbase Smart AI Proxy?"></a>What Is Crawlbase Smart AI Proxy?</h2><p>Crawlbase Smart AI Proxy is designed as an AI Proxy, not a traditional <a href="https://www.intelligenthq.com/what-is-a-proxy-pool-how-it-works/">proxy pool</a>. Instead of exposing multiple proxy products and configuration surfaces, it provides one proxy endpoint that routes traffic through datacenter or residential IPs automatically.</p><p>Key characteristics include:</p><ul><li>A single proxy endpoint for datacenter and residential traffic</li><li>Automatic IP selection and rotation handled internally</li><li>Built-in block mitigation enhanced by AI</li><li>Geotargeting passed per request through headers</li><li>Minimal proxy configuration requirement</li></ul><p>On the <a href="https://crawlbase.com/smart-proxy#pricing">free trial</a>, you get <strong>5,000</strong> free requests, <strong>5</strong> concurrent threads, and access to <strong>100,000</strong> unique datacenter proxies, with AI-managed auto-geolocation optimized for success rate. Paid plans scale up to <strong>80</strong> concurrent threads, custom geolocation across <strong>45+</strong> countries, and up to <strong>1 million unique proxies</strong>, with enterprise packages available through <a href="https://crawlbase.com/contact">customer support</a>.</p><h3 id="How-Requests-Work-in-Practice"><a href="#How-Requests-Work-in-Practice" class="headerlink" title="How Requests Work in Practice"></a>How Requests Work in Practice</h3><p>You <a href="https://crawlbase.com/docs/smart-proxy/#the-smart-ai-proxy-in-minutes">send requests</a> using standard proxy syntax. Geolocation can be passed per request, not baked into proxy credentials.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">curl -H <span class="string">&quot;CrawlbaseAPI-Parameters: country=US&quot;</span> \</span><br><span class="line">  -x <span class="string">&quot;http://AUTH_KEY@smartproxy.crawlbase.com:8012&quot;</span> \</span><br><span class="line">  -k <span class="string">&quot;https://ipgeolocation.io/what-is-my-ip&quot;</span></span><br></pre></td></tr></table></figure><p>There are no proxy lists to rotate, no session identifiers to manage, and no need to choose between proxy types.</p><h2 id="What-Are-Oxylabs-Proxies"><a href="#What-Are-Oxylabs-Proxies" class="headerlink" title="What Are Oxylabs Proxies?"></a>What Are Oxylabs Proxies?</h2><p><a href="https://oxylabs.io/">Oxylabs</a> offers a broad set of traditional proxy services, each packaged and billed separately. These include residential proxies, datacenter proxies, dedicated datacenter proxies, ISP proxies, and mobile proxies.</p><p>Core characteristics of the Oxylabs approach:</p><ul><li>Multiple proxy products, each optimized for a specific use case</li><li>Manual configuration of rotation, sessions, and targeting</li><li>Very large IP pools are distributed across separate services</li><li>Pricing based on bandwidth usage or subscriptions, not per successful request</li></ul><p>Oxylabs is suited for teams that need precise control over proxy behavior. That control comes with responsibility. You choose which proxy product to use, how sessions behave, and how rotation should work.</p><h3 id="Example-Using-Oxylabs-Datacenter-Proxies"><a href="#Example-Using-Oxylabs-Datacenter-Proxies" class="headerlink" title="Example: Using Oxylabs Datacenter Proxies"></a>Example: Using Oxylabs Datacenter Proxies</h3><p>To use Oxylabs datacenter proxies, you first create a proxy user with a username and password. Requests are then sent through a specific endpoint.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">curl -x dc.oxylabs.io:8000 \</span><br><span class="line">  -U <span class="string">&quot;user-USERNAME:PASSWORD&quot;</span> \</span><br><span class="line">  https://ip.oxylabs.io/location</span><br></pre></td></tr></table></figure><p>Their datacenter free plan includes <strong>5 GB per month</strong> with <strong>5 rotating IPs</strong>. Other proxy types follow different pricing and access rules.</p><h2 id="3-Key-Differences-Between-Smart-AI-Proxy-and-Oxylabs-Proxies"><a href="#3-Key-Differences-Between-Smart-AI-Proxy-and-Oxylabs-Proxies" class="headerlink" title="3 Key Differences Between Smart AI Proxy and Oxylabs Proxies"></a>3 Key Differences Between Smart AI Proxy and Oxylabs Proxies</h2><h3 id="Difference-1-How-Does-Smart-AI-Proxy-Automation-Compare-to-Oxylabs"><a href="#Difference-1-How-Does-Smart-AI-Proxy-Automation-Compare-to-Oxylabs" class="headerlink" title="Difference 1: How Does Smart AI Proxy Automation Compare to Oxylabs?"></a>Difference 1: How Does Smart AI Proxy Automation Compare to Oxylabs?</h3><p>The most visible difference between Crawlbase and Oxylabs is the engineering between each solution.</p><p>With <strong>Crawlbase Smart AI Proxy</strong>, IP selection, rotation, retries, and geolocation decisions are handled automatically. You integrate once and let the system adapt to the target site.</p><p>With <strong>Oxylabs</strong>, you explicitly configure how proxies behave. You decide which proxy product to use, whether sessions should persist, and how rotation is applied.</p><p>Result: This difference shows up immediately in time to first success. Automated setups reach stable scraping faster. Manual setups offer flexibility but take longer to tune.</p><h3 id="Difference-2-What-Makes-Smart-AI-Proxy’s-Block-Handling-Different"><a href="#Difference-2-What-Makes-Smart-AI-Proxy’s-Block-Handling-Different" class="headerlink" title="Difference 2: What Makes Smart AI Proxy’s Block Handling Different?"></a>Difference 2: What Makes Smart AI Proxy’s Block Handling Different?</h3><p>Block handling determines how reliably a scraping system maintains success rates over time.</p><p>Crawlbase integrates proxy management with crawling intelligence. When a request is blocked, the system automatically adjusts routing, IP behavior, and retry logic without requiring developer intervention. This is particularly effective for JavaScript-heavy or heavily protected pages.</p><p>Oxylabs relies on proxy performance combined with optional unblocker tools such as Web Unblocker. While effective, these typically require correct configuration, product selection, and ongoing tuning. As target sites change their defenses, adjustments are often handled manually.</p><p>Result: Smart AI Proxy’s automated block handling reduces the need for firefighting and ongoing maintenance.</p><p>The practical outcome is fewer firefighting cycles when block mitigation is automated.</p><h3 id="Difference-3-How-Much-Ongoing-Maintenance-Does-Each-Option-Require"><a href="#Difference-3-How-Much-Ongoing-Maintenance-Does-Each-Option-Require" class="headerlink" title="Difference 3: How Much Ongoing Maintenance Does Each Option Require?"></a>Difference 3: How Much Ongoing Maintenance Does Each Option Require?</h3><p>Operational overhead is often underestimated.</p><p>With Crawlbase Smart AI Proxy, there are fewer moving parts. You do not manage proxy pools, session lifetimes, or rotation strategies. Maintenance effort stays low even as targets evolve.</p><p>With Oxylabs, overhead increases with scale. Each proxy category has its own limits, pricing model, and configuration surface. As workloads grow, teams spend more time monitoring usage, adjusting sessions, and re-tuning behavior.</p><p>Result: For small teams, this difference can be the deciding factor.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get a <span class="text-underline">Free Smart AI Proxy Trial</span></h3>    <p class="banner-desc">Leverage 5,000 free credits, 140M rotating proxies, and AI to bypass CAPTCHAs and avoid blocks.</p>    <div class="banner-features">      <ul class="features-list">        <li>Unlimited Bandwidth</li>        <li>Custom Geolocalization</li>        <li>100% Network Uptime</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get 5,000 Free Credits" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'smart_proxy', 'blog_slug': 'difference-between-smart-ai-proxy-and-oxylabs-proxies', 'cta_type': 'try_smart_proxy', 'cta_position': 'top','cta_version': 'smart_proxy_v2'});">Get 5,000 Free Credits</a>    </div>  </div>  </div><h2 id="Which-Proxy-Should-You-Choose-for-Your-Use-Case"><a href="#Which-Proxy-Should-You-Choose-for-Your-Use-Case" class="headerlink" title="Which Proxy Should You Choose for Your Use Case?"></a>Which Proxy Should You Choose for Your Use Case?</h2><h3 id="Startups-and-small-teams"><a href="#Startups-and-small-teams" class="headerlink" title="Startups and small teams"></a>Startups and small teams</h3><p>If you are building a product and scraping is just one part of it, proxy management can quietly consume more time than expected. Rotation logic, retry handling, and geolocation tuning are rarely what your team wants to spend cycles on.</p><p>Smart AI Proxy makes sense here because it removes those decisions. You point your scraper to one endpoint and focus on parsing and storing data. The value is not just fewer settings. It is fewer failure points.</p><h3 id="Data-teams-scraping-dynamic-or-protected-pages"><a href="#Data-teams-scraping-dynamic-or-protected-pages" class="headerlink" title="Data teams scraping dynamic or protected pages"></a>Data teams scraping dynamic or protected pages</h3><p>Some targets are stable. Others are not. E-commerce platforms, travel aggregators, and search results pages often change defenses without warning. What worked last week can start returning partial responses or silent blocks today.</p><p>In those environments, reliability depends on how quickly your system adapts. Smart AI Proxy handles IP rotation and retry strategies internally, so when a route starts failing, traffic can shift without you rewriting code. That reduces the number of manual restarts and emergency patches.</p><p>For example, if you are collecting data from a marketplace protected by services like <a href="https://www.cloudflare.com/">Cloudflare</a> or <a href="https://datadome.co/">DataDome</a>, you typically need rotation logic, retry handling, and sometimes rendering support. With an AI-managed proxy layer, much of that coordination happens upstream of your scraper.</p><h3 id="Larger-teams-with-strict-control-requirements"><a href="#Larger-teams-with-strict-control-requirements" class="headerlink" title="Larger teams with strict control requirements"></a>Larger teams with strict control requirements</h3><p>Some organizations operate at a scale or level of specificity where automation alone is not enough. These teams may require sticky sessions, long-lived connections, or precise control over IP characteristics for compliance, testing, or account-bound workflows.</p><p>Oxylabs fits these scenarios better because it exposes residential, datacenter, ISP, and mobile proxies as separate services, and you can tailor the setup to very specific requirements. The tradeoff is that your team is responsible for configuring and maintaining that setup over time.</p><p>For organizations with dedicated scraping infrastructure, that level of control can be worth the added complexity.</p><h2 id="Is-Smart-AI-Proxy-Cheaper-Than-Oxylabs"><a href="#Is-Smart-AI-Proxy-Cheaper-Than-Oxylabs" class="headerlink" title="Is Smart AI Proxy Cheaper Than Oxylabs?"></a>Is Smart AI Proxy Cheaper Than Oxylabs?</h2><p><strong>It depends on your request volume, response sizes, and success rates</strong>. Smart AI Proxy charges per credit for successfully extracted data only. Oxylabs Web Unblocker charges per GB of bandwidth consumed, regardless of whether requests succeed or fail.</p><h3 id="Smart-AI-Proxy-pricing-success-based"><a href="#Smart-AI-Proxy-pricing-success-based" class="headerlink" title="Smart AI Proxy pricing (success-based):"></a>Smart AI Proxy pricing (success-based):</h3><ul><li>Starter: $149&#x2F;month for 200,000 credits</li><li>Advanced: $229&#x2F;month for 1M credits</li><li>Premium: $449&#x2F;month for 3M credits</li><li>Free trial: 5,000 credits</li></ul><h3 id="Oxylabs-Web-Unblocker-pricing-bandwidth-based"><a href="#Oxylabs-Web-Unblocker-pricing-bandwidth-based" class="headerlink" title="Oxylabs Web Unblocker pricing (bandwidth-based):"></a>Oxylabs Web Unblocker pricing (bandwidth-based):</h3><ul><li>Pay-as-you-go: $5.64&#x2F;GB (Micro), $5.16&#x2F;GB (Starter), $4.50&#x2F;GB (Advanced)</li><li>Enterprise: $4.20&#x2F;GB (Venture), $3.60&#x2F;GB (Business), $3.00&#x2F;GB (Corporate)</li><li>Free trial: 1GB</li></ul><p>The cheaper option depends on your specific use case. For accurate cost comparison, you need to know your average response size and typical success rate for your target sites.</p><h2 id="Choose-Smart-AI-Proxy-for-Dynamic-or-Protected-Pages"><a href="#Choose-Smart-AI-Proxy-for-Dynamic-or-Protected-Pages" class="headerlink" title="Choose Smart AI Proxy for Dynamic or Protected Pages"></a>Choose Smart AI Proxy for Dynamic or Protected Pages</h2><p>Teams scraping JavaScript-heavy sites or frequently defended platforms face a different challenge. The issue is not just IP reputation, but how often scraping jobs break when defenses change.</p><p><strong>Smart AI Proxy integrates block mitigation into the request flow</strong>, which reduces the need for manual retries and emergency fixes. In practice, this means fewer failed runs and less intervention from engineers during off-hours.</p><p>A common example is tracking prices or listings on large marketplaces where content is dynamically generated and protected by multiple layers of anti-bot logic (Cloudflare, PerimeterX, DataDome). Instead of combining proxies with separate rendering services, CAPTCHA solvers, and custom retry logic, teams can rely on the proxy layer to absorb much of that complexity automatically.</p><h2 id="Getting-Started-With-Smart-AI-Proxy"><a href="#Getting-Started-With-Smart-AI-Proxy" class="headerlink" title="Getting Started With Smart AI Proxy"></a>Getting Started With Smart AI Proxy</h2><p>Smart AI Proxy integrates with existing scraping code using standard proxy syntax. There is no proxy pool setup and no long onboarding process.</p><p>You can <a href="https://crawlbase.com/signup?signup=blog">start with 5,000 free requests</a>; Test against your real targets and decide based on the measured results. The integration takes 15-30 minutes and requires only adding proxy credentials to your HTTP client.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="Q-Is-Smart-AI-Proxy-a-traditional-proxy-pool"><a href="#Q-Is-Smart-AI-Proxy-a-traditional-proxy-pool" class="headerlink" title="Q. Is Smart AI Proxy a traditional proxy pool?"></a>Q. Is Smart AI Proxy a traditional proxy pool?</h3><p>No. Smart AI Proxy is an AI-managed proxy layer rather than a traditional proxy pool. Instead of providing a list of IPs or requiring you to manage rotation and sessions, it exposes a single endpoint that handles IP selection, rotation, geolocation, and block mitigation internally. This abstraction reduces the amount of proxy logic you need to maintain while still supporting standard scraping workflows.</p><h3 id="Q-Does-Smart-AI-Proxy-Support-JavaScript-Rendering"><a href="#Q-Does-Smart-AI-Proxy-Support-JavaScript-Rendering" class="headerlink" title="Q. Does Smart AI Proxy Support JavaScript Rendering?"></a>Q. Does Smart AI Proxy Support JavaScript Rendering?</h3><p>Yes, Smart AI Proxy includes built-in <a href="https://www.geeksforgeeks.org/javascript/what-is-javascript-rendering/">JavaScript rendering</a> for dynamic sites. You enable it by adding a single header to your requests. No separate headless browser infrastructure is required.</p><p>The proxy layer handles rendering, waits for AJAX calls to complete, and returns the fully rendered HTML. This is particularly useful for scraping React, Angular, or Vue.js applications where content loads after the initial page load.</p><h3 id="Q-When-should-I-choose-Crawlbase-Smart-AI-Proxy-over-Oxylabs"><a href="#Q-When-should-I-choose-Crawlbase-Smart-AI-Proxy-over-Oxylabs" class="headerlink" title="Q. When should I choose Crawlbase Smart AI Proxy over Oxylabs?"></a>Q. When should I choose Crawlbase Smart AI Proxy over Oxylabs?</h3><p>Choose Crawlbase Smart AI Proxy when you want reliable scraping with minimal proxy management. It is a better fit if your team prefers automation over manual configuration, if your targets are frequently blocked or change behavior often, or if you want costs tied to successful requests rather than bandwidth usage.</p><p>For teams that value faster setup and lower operational overhead, Crawlbase reduces the need to manage proxy pools, sessions, and retry logic.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;The three key differences between Smart AI Proxy and Oxylabs proxies are &lt;strong&gt;automation level, block-handling intelligence, and operational overhead&lt;/strong&gt;. Smart AI Proxy works as an AI-managed proxy layer that automatically selects IPs, rotates traffic, applies geolocation, and adapts retry strategies when blocks occur—all through a single endpoint. Oxylabs, in contrast, offers multiple proxy services where you manually choose proxy types, configure sessions, manage targeting for each use case, and implement your own retry logic when requests fail.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="ai proxy" scheme="https://crawlbase.com/blog/tags/ai-proxy/"/>
    
    <category term="smart ai proxy" scheme="https://crawlbase.com/blog/tags/smart-ai-proxy/"/>
    
    <category term="web proxies" scheme="https://crawlbase.com/blog/tags/web-proxies/"/>
    
    <category term="web unblocker" scheme="https://crawlbase.com/blog/tags/web-unblocker/"/>
    
  </entry>
  
  <entry>
    <title>How to Access Geo-locked Data with Smart AI Proxy</title>
    <link href="https://crawlbase.com/blog/unblock-data-with-smart-ai-proxy/"/>
    <id>https://crawlbase.com/blog/unblock-data-with-smart-ai-proxy/</id>
    <published>2026-02-11T03:25:25.000Z</published>
    <updated>2026-04-24T11:53:23.951Z</updated>
    
    <content type="html"><![CDATA[<p>Accessing geo-locked data at scale requires more than just IP rotation. You need precise control over country and ZIP code targeting, along with automatic handling of blocks, sessions, and location-specific cookies. Traditional VPNs and proxy pools struggle when you need ZIP-accurate pricing from Amazon or country-specific SERPs from Google.</p><span id="more"></span><p><a href="https://crawlbase.com/smart-proxy">Smart AI Proxy</a> addresses this by allowing you to specify geolocation for each request using headers. Meanwhile, AI systems manage IP selection, rotation, and block mitigation based on real-time response signals.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Try our AI-powered Proxies</h2>  </div>  <p class="banner-body">Why use a standard backconnect proxy when you can use AI? Bypass blocks and scale your crawler with 1M+ rotating IPs.</p>  <div class="banner-footer">    <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Claim 5,000 Free Credits">Claim 5,000 Free Credits</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Why-Is-Geo-Locked-Data-Hard-to-Access-at-Scale"><a href="#Why-Is-Geo-Locked-Data-Hard-to-Access-at-Scale" class="headerlink" title="Why Is Geo-Locked Data Hard to Access at Scale?"></a>Why Is Geo-Locked Data Hard to Access at Scale?</h2><p>Geo-locked data changes based on multiple signals, not just IP address.</p><p>Key factors include:</p><ul><li>Country-specific pricing, SERPs, and availability</li><li>IP geolocation and ASN reputation</li><li>Request headers such as Accept-Language</li><li>Cookies and delivery location context</li></ul><p>That is why the same URL can return different HTML depending on where the request appears to come from.</p><p>In practice, you see this everywhere:</p><ul><li>Amazon shows different prices, taxes, and delivery options by country and ZIP code</li><li>Google SERPs change by country and city</li><li>Local marketplaces expose different sellers and inventory per region</li></ul><p>When you scale up from a handful of requests to thousands or millions, keeping all of these signals aligned and consistent becomes the real challenge.</p><h2 id="Why-VPNs-and-Manual-Proxy-Setups-Fail-for-Geo-Targeted-Scraping"><a href="#Why-VPNs-and-Manual-Proxy-Setups-Fail-for-Geo-Targeted-Scraping" class="headerlink" title="Why VPNs and Manual Proxy Setups Fail for Geo-Targeted Scraping"></a>Why VPNs and Manual Proxy Setups Fail for Geo-Targeted Scraping</h2><p>Most teams start with VPNs or simple proxy pools, and they often work during early testing. The problems appear as soon as volume and precision matter.</p><p>Core reasons:</p><ul><li>Most <a href="https://www.kaspersky.com/resource-center/definitions/what-is-a-vpn">VPNs are meant for human browsing</a>, not automated HTTP requests</li><li>Proxy pools suffer from IP reuse and geo-drift</li><li>Location context is not preserved across sessions</li><li>ZIP-level targeting is impossible without browser automation</li></ul><p>Common failure modes you see in production:</p><ul><li>Inconsistent geolocation results</li><li>High CAPTCHA and block rates</li><li>Session leakage across regions</li><li>Manual IP rotation and retry logic</li><li>Fragile browser workflows when sites change UI</li></ul><p>These issues compound quickly once you move beyond testing with a handful of requests and try to scale across multiple markets or regions.</p><h2 id="What-Is-Smart-AI-Proxy"><a href="#What-Is-Smart-AI-Proxy" class="headerlink" title="What Is Smart AI Proxy?"></a>What Is Smart AI Proxy?</h2><p><a href="https://crawlbase.com/smart-proxy">Smart AI Proxy</a> is a single proxy endpoint where geolocation, rotation, and blocking are handled automatically by Crawlbase using AI-driven decisioning. You control behavior per request using headers instead of managing IP lists, cookies, or browsers.</p><p>All traffic is routed through a single endpoint:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">smartproxy.crawlbase.com:8012 or 8013</span><br></pre></td></tr></table></figure><p>When you need to apply geolocation or other behavior, you include the CrawlbaseAPI-Parameters header in your request, for example:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">CrawlbaseAPI-Parameters: country=US&amp;javascript=true</span><br></pre></td></tr></table></figure><p>From there, Crawlbase takes over. AI models continuously evaluate request context, target behavior, and historical outcomes to select an appropriate IP, align headers with the target region, manage cookies and session state, and validate that the response matches the requested location.</p><h2 id="How-Does-Smart-AI-Proxy-Handle-Geolocation-Automatically"><a href="#How-Does-Smart-AI-Proxy-Handle-Geolocation-Automatically" class="headerlink" title="How Does Smart AI Proxy Handle Geolocation Automatically?"></a>How Does Smart AI Proxy Handle Geolocation Automatically?</h2><h3 id="Automatic-IP-selection-and-rotation"><a href="#Automatic-IP-selection-and-rotation" class="headerlink" title="Automatic IP selection and rotation"></a>Automatic IP selection and rotation</h3><p>When you specify a <a href="https://crawlbase.com/docs/crawling-api/parameters/#country">country parameter</a> such as <code>country=GB</code>, Crawlbase:</p><ul><li>Selects a clean UK IP using AI-assisted routing logic</li><li>Applies matching headers such as Accept-Language</li><li>Routes the request through that IP</li><li>Rotates IPs automatically to reduce fingerprinting</li></ul><p>You do not manage IP pools, rotation rules, or session lifetimes yourself.</p><h3 id="Built-in-block-mitigation"><a href="#Built-in-block-mitigation" class="headerlink" title="Built-in block mitigation"></a>Built-in block mitigation</h3><p>Smart AI Proxy handles common blocking mechanisms automatically:</p><ul><li>Header normalization to browser-like patterns</li><li>JavaScript challenge handling via <a href="https://crawlbase.com/docs/smart-proxy/headless-browsers/">Headless Browsers</a> (add <code>javascript=true</code>)</li><li>CAPTCHA detection with automatic retries</li><li>Fallback strategies with AI-assisted solutions when blocks are detected</li></ul><p>From your side, requests remain standard HTTP calls. You do not manage IP pools, rotation rules, or session lifetimes yourself.</p><h3 id="ZIP-level-cookie-management-for-Amazon"><a href="#ZIP-level-cookie-management-for-Amazon" class="headerlink" title="ZIP-level cookie management for Amazon"></a>ZIP-level cookie management for Amazon</h3><p>For Amazon pages, Smart AI Proxy supports a dedicated zipcode parameter that:</p><ul><li>Generates ZIP-specific location cookies</li><li>Injects them into the request</li><li>Ensures delivery location matches the target ZIP</li><li>Keeps sessions isolated between requests</li></ul><p>This approach removes the need for browser automation tools such as Puppeteer, Playwright, or Selenium, while still producing HTML that matches what real users see in a specific location.</p><h2 id="How-Do-You-Target-a-Specific-Country-Using-Smart-AI-Proxy"><a href="#How-Do-You-Target-a-Specific-Country-Using-Smart-AI-Proxy" class="headerlink" title="How Do You Target a Specific Country Using Smart AI Proxy?"></a>How Do You Target a Specific Country Using Smart AI Proxy?</h2><p>Country-level targeting requires three steps.</p><ol><li><strong>Use the Crawlbase Smart AI Proxy endpoint:</strong> <code>smartproxy.crawlbase.com:8012</code> (HTTP) or port 8013 (HTTPS)</li><li><strong>Passing country parameter via header:</strong> Add <code>CrawlbaseAPI-Parameters: country=XX</code> where XX is the ISO country code</li><li><strong>Send your request:</strong> The response will reflect the geo-targeted content for that country</li></ol><h3 id="Practical-Example-Amazon-Product-Pricing-Across-Countries"><a href="#Practical-Example-Amazon-Product-Pricing-Across-Countries" class="headerlink" title="Practical Example: Amazon Product Pricing Across Countries"></a>Practical Example: Amazon Product Pricing Across Countries</h3><p>This example compares <a href="https://www.amazon.com/dp/B09XS7JWHH">Sony WH-1000XM5</a> pricing between the US and the UK using the same code and URL.</p><p>You can also get the complete script in our <a href="https://github.com/ScraperHub/how-to-access-region-locked-data-with-smart-ai-proxy/blob/main/example1.py">GitHub page</a>.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> urllib.parse <span class="keyword">import</span> urlencode</span><br><span class="line"><span class="keyword">from</span> urllib3.exceptions <span class="keyword">import</span> InsecureRequestWarning</span><br><span class="line"></span><br><span class="line">requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)</span><br><span class="line"></span><br><span class="line">input_url = <span class="string">&quot;https://www.amazon.com/Sony-WH-1000XM5-Canceling-Headphones-Hands-Free/dp/B09XS7JWHH/ref=sr_1_1&quot;</span></span><br><span class="line"></span><br><span class="line">private_access_token = <span class="string">&quot;YOUR_CRAWLBASE_TOKEN&quot;</span></span><br><span class="line">proxy_url = <span class="string">f&quot;http://<span class="subst">&#123;private_access_token&#125;</span>:@smartproxy.crawlbase.com:8012&quot;</span>  <span class="comment"># Use https:// and port 8013 for HTTPS</span></span><br><span class="line">proxies = &#123;</span><br><span class="line">    <span class="string">&quot;http&quot;</span>: proxy_url,</span><br><span class="line">    <span class="string">&quot;https&quot;</span>: proxy_url</span><br><span class="line">&#125;</span><br><span class="line">crawlbase_api_parameters = &#123;</span><br><span class="line">    <span class="string">&quot;country&quot;</span>: <span class="string">&quot;US&quot;</span>,</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">try</span>:</span><br><span class="line">    response = requests.get(</span><br><span class="line">        url=input_url,</span><br><span class="line">        headers=&#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: urlencode(crawlbase_api_parameters)&#125;,</span><br><span class="line">        proxies=proxies,</span><br><span class="line">        verify=<span class="literal">False</span>,</span><br><span class="line">        timeout=<span class="number">30</span></span><br><span class="line">    )</span><br><span class="line">    response.raise_for_status()  <span class="comment"># Raise an exception for bad status codes</span></span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">&#x27;Response Code:&#x27;</span>, response.status_code)</span><br><span class="line"></span><br><span class="line">    output_file_name = <span class="string">f&quot;example1-<span class="subst">&#123;crawlbase_api_parameters[<span class="string">&#x27;country&#x27;</span>]&#125;</span>.html&quot;</span></span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(output_file_name, <span class="string">&#x27;w&#x27;</span>, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> f:</span><br><span class="line">        f.write(response.text)</span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&#x27;Response saved to <span class="subst">&#123;output_file_name&#125;</span>&#x27;</span>)</span><br><span class="line"><span class="keyword">except</span> requests.exceptions.RequestException <span class="keyword">as</span> e:</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;An error occurred: <span class="subst">&#123;e&#125;</span>&quot;</span>)</span><br></pre></td></tr></table></figure><img src="/blog/unblock-data-with-smart-ai-proxy/example-us-one.jpg" class="" title="Amazon Headphones Product Page; AI Proxy Scraping"><p>The response shows:</p><ul><li>Prices in US Dollars (USD)</li><li>US sales tax information</li><li>US-specific product availability</li><li>amazon.com seller rankings and Prime eligibility</li></ul><p>Now change only one parameter (from country&#x3D;US to GB).</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">crawlbase_api_parameters = &#123;</span><br><span class="line">    <span class="string">&quot;country&quot;</span>: <span class="string">&quot;GB&quot;</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><img src="/blog/unblock-data-with-smart-ai-proxy/example-gb-one.jpg" class="" title="Amazon Headphones Product Page; AI Proxy Scraping"><p>The UK response shows:</p><ul><li>Prices in British Pounds (GBP)</li><li>VAT-inclusive pricing (20%)</li><li>Different availability based on local inventory</li><li><a href="https://www.amazon.co.uk/">Amazon.co.uk</a> specific deals and Prime benefits</li></ul><p>This is request-level geo-targeting in practice.</p><h2 id="How-Do-You-Scrape-ZIP-Level-Pricing-With-Smart-AI-Proxy"><a href="#How-Do-You-Scrape-ZIP-Level-Pricing-With-Smart-AI-Proxy" class="headerlink" title="How Do You Scrape ZIP-Level Pricing With Smart AI Proxy"></a>How Do You Scrape ZIP-Level Pricing With Smart AI Proxy</h2><p>Country-level targeting works for broad comparisons, but it falls short when you need accurate pricing. For Amazon’s specific case, it does not show a single price for the entire US. What customers see depends on their delivery ZIP code, and that difference affects total cost, availability, and delivery promises.</p><p>Crawlbase Smart AI Proxy solves this specific problem for Amazon by letting you pass ZIP-level context directly with the request. Instead of running a browser to set a delivery location, you include both <code>country</code> and <code>zipcode</code> parameters, such as <code>country=US&amp;zipcode=10001</code>.</p><p>The result is Amazon HTML that matches what a real customer in that ZIP code would see, without browser automation, cookie management, or fragile UI workflows.</p><h3 id="Supported-Countries-for-ZIP-Postal-Code-Targeting"><a href="#Supported-Countries-for-ZIP-Postal-Code-Targeting" class="headerlink" title="Supported Countries for ZIP&#x2F;Postal Code Targeting:"></a>Supported Countries for ZIP&#x2F;Postal Code Targeting:</h3><ul><li><strong>Americas:</strong> United States, Canada, Brazil, Mexico</li><li><strong>Europe:</strong> United Kingdom, Germany, France, Spain, Italy, Netherlands, Sweden, Poland</li><li><strong>Asia-Pacific:</strong> Japan, India, Singapore, Australia</li><li><strong>Middle East:</strong> United Arab Emirates, Saudi Arabia</li></ul><p>All ZIP codes are pre-validated to ensure they’re recognized by the target e-commerce site.</p><h3 id="Practical-example-Amazon-product-pricing-across-countries"><a href="#Practical-example-Amazon-product-pricing-across-countries" class="headerlink" title="Practical example: Amazon product pricing across countries"></a>Practical example: Amazon product pricing across countries</h3><p>Let’s compare Amazon pricing for the same product between the US and the UK. (You can see this full code example in our <a href="https://github.com/ScraperHub/how-to-access-region-locked-data-with-smart-ai-proxy/blob/main/example2.py">GitHub page</a>)</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> urllib.parse <span class="keyword">import</span> urlencode</span><br><span class="line"><span class="keyword">from</span> urllib3.exceptions <span class="keyword">import</span> InsecureRequestWarning</span><br><span class="line"></span><br><span class="line">requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)</span><br><span class="line"></span><br><span class="line">input_url = <span class="string">&quot;https://www.amazon.com/Mount-Comfort-Coffee-Organic-Whole/dp/B07171HMF5/ref=sr_1_2&quot;</span></span><br><span class="line"></span><br><span class="line">private_access_token = <span class="string">&quot;YOUR_CRAWLBASE_TOKEN&quot;</span></span><br><span class="line">proxy_url = <span class="string">f&quot;http://<span class="subst">&#123;private_access_token&#125;</span>:@smartproxy.crawlbase.com:8012&quot;</span>  <span class="comment"># Use https:// and port 8013 for HTTPS</span></span><br><span class="line">proxies = &#123;</span><br><span class="line">    <span class="string">&quot;http&quot;</span>: proxy_url,</span><br><span class="line">    <span class="string">&quot;https&quot;</span>: proxy_url</span><br><span class="line">&#125;</span><br><span class="line">crawlbase_api_parameters = &#123;</span><br><span class="line">    <span class="string">&quot;country&quot;</span>: <span class="string">&quot;US&quot;</span>,</span><br><span class="line">    <span class="string">&quot;zipcode&quot;</span>: <span class="string">&quot;90210&quot;</span>, <span class="comment">#10004</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">try</span>:</span><br><span class="line">    response = requests.get(</span><br><span class="line">        url=input_url,</span><br><span class="line">        headers=&#123;<span class="string">&quot;CrawlbaseAPI-Parameters&quot;</span>: urlencode(crawlbase_api_parameters)&#125;,</span><br><span class="line">        proxies=proxies,</span><br><span class="line">        verify=<span class="literal">False</span>,</span><br><span class="line">        timeout=<span class="number">30</span></span><br><span class="line">    )</span><br><span class="line">    response.raise_for_status()  <span class="comment"># Raise an exception for bad status codes</span></span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">&#x27;Response Code:&#x27;</span>, response.status_code)</span><br><span class="line"></span><br><span class="line">    output_file_name = <span class="string">f&quot;example2-<span class="subst">&#123;crawlbase_api_parameters[<span class="string">&#x27;country&#x27;</span>]&#125;</span>-<span class="subst">&#123;crawlbase_api_parameters[<span class="string">&#x27;zipcode&#x27;</span>]&#125;</span>.html&quot;</span></span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(output_file_name, <span class="string">&#x27;w&#x27;</span>, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> f:</span><br><span class="line">        f.write(response.text)</span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&#x27;Response saved to <span class="subst">&#123;output_file_name&#125;</span>&#x27;</span>)</span><br><span class="line"><span class="keyword">except</span> requests.exceptions.RequestException <span class="keyword">as</span> e:</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;An error occurred: <span class="subst">&#123;e&#125;</span>&quot;</span>)</span><br></pre></td></tr></table></figure><p><strong>Result:</strong></p><img src="/blog/unblock-data-with-smart-ai-proxy/example-us-two.jpg" class="" title="Amazon Roasted Coffee Beans Product Page; AI Proxy Scraping"><ul><li><strong>Price:</strong> $28.27</li><li><strong>Delivery location:</strong> “Deliver to Beverly Hills 90210.”</li><li><strong>Sales tax:</strong> 9.5% California sales tax</li><li><strong>Prime delivery:</strong> Location-specific delivery estimates</li></ul><p>Now change one line.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">crawlbase_api_parameters = &#123;</span><br><span class="line">    <span class="string">&quot;country&quot;</span>: <span class="string">&quot;GB&quot;</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><strong>Result:</strong></p><img src="/blog/unblock-data-with-smart-ai-proxy/example-gb-two.jpg" class="" title="Amazon Roasted Coffee Beans Product Page; AI Proxy Scraping"><p>In this case, the same product is not available on Amazon UK at the time of scraping. This is not a formatting difference or a currency issue. It reflects real availability constraints in that market.</p><p>Without geo-accurate targeting, you might incorrectly assume a product is globally available, misjudge competitive pressure, or make pricing decisions based on data that customers in a given region never actually see. ZIP- and country-level accuracy turns Amazon scraping from a rough signal into something you can rely on for pricing analysis and market decisions.</p><h2 id="Real-World-Use-Cases-for-Geo-Targeted-Scraping"><a href="#Real-World-Use-Cases-for-Geo-Targeted-Scraping" class="headerlink" title="Real-World Use Cases for Geo-Targeted Scraping"></a>Real-World Use Cases for Geo-Targeted Scraping</h2><h3 id="E-commerce-price-monitoring-by-country-or-city"><a href="#E-commerce-price-monitoring-by-country-or-city" class="headerlink" title="E-commerce price monitoring by country or city"></a>E-commerce price monitoring by country or city</h3><p>To stay competitive, teams need to know what customers in each market actually see, not a converted or averaged price.</p><p>With geo-targeted scraping, this usually means running automated daily crawls on Amazon or other marketplaces using country- or city-specific targeting.</p><p>A typical workflow looks something like this:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">markets = [</span><br><span class="line">    &#123;<span class="string">&quot;country&quot;</span>: <span class="string">&quot;US&quot;</span>, <span class="string">&quot;zipcode&quot;</span>: <span class="string">&quot;10001&quot;</span>&#125;,</span><br><span class="line">    &#123;<span class="string">&quot;country&quot;</span>: <span class="string">&quot;GB&quot;</span>, <span class="string">&quot;zipcode&quot;</span>: <span class="string">&quot;SW1A 1AA&quot;</span>&#125;,</span><br><span class="line">    &#123;<span class="string">&quot;country&quot;</span>: <span class="string">&quot;DE&quot;</span>, <span class="string">&quot;zipcode&quot;</span>: <span class="string">&quot;10115&quot;</span>&#125;,</span><br><span class="line">    &#123;<span class="string">&quot;country&quot;</span>: <span class="string">&quot;JP&quot;</span>, <span class="string">&quot;zipcode&quot;</span>: <span class="string">&quot;100-0001&quot;</span>&#125;</span><br><span class="line">]</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> market <span class="keyword">in</span> markets:</span><br><span class="line">    response = scrape_with_smart_proxy(</span><br><span class="line">        url=product_url,</span><br><span class="line">        country=market[<span class="string">&quot;country&quot;</span>],</span><br><span class="line">        zipcode=market[<span class="string">&quot;zipcode&quot;</span>]</span><br><span class="line">    )</span><br><span class="line">    prices[market[<span class="string">&quot;country&quot;</span>]] = extract_price(response)</span><br></pre></td></tr></table></figure><p>Each run produces pricing data that reflects real local conditions. Over time, this gives you a reliable view of how competitors adjust prices by market and where meaningful gaps exist.</p><h3 id="Local-SEO-and-SERP-Tracking"><a href="#Local-SEO-and-SERP-Tracking" class="headerlink" title="Local SEO and SERP Tracking"></a>Local SEO and SERP Tracking</h3><p>Search engines personalize results in several ways, and <em>location is one of the biggest factors</em>. <a href="https://support.google.com/websearch/answer/12412910">Google’s documentation</a> confirms that your search results can differ from someone else’s results based on where you are when you make the query.</p><p>For SEO professionals, this means you cannot rely on rank data pulled from a single location to represent how audiences in different regions experience search visibility. Running geo-targeted rank tracking lets you understand how your site performs across markets, whether you’re measuring organic positions, featured snippets, or local pack results.</p><h3 id="Market-Research-and-Competitive-Intelligence"><a href="#Market-Research-and-Competitive-Intelligence" class="headerlink" title="Market Research and Competitive Intelligence"></a>Market Research and Competitive Intelligence</h3><p>Market expansion usually fails before execution begins. Pricing, availability, and competitive pressure change once you look at a market from the inside instead of relying on global or home-country views.</p><p>Manual checks do not scale beyond a few regions. Geo-targeted scraping does. Pulling data from local versions of e-commerce sites shows what customers actually see, not converted prices or inferred availability.</p><p><strong>Example scenario:</strong> A US brand evaluating Europe scraped localized data from Germany, France, and Spain and found:</p><ul><li>Prices are about 20% higher in France than in Germany</li><li>Over-saturated categories in Spain</li><li>Strong demand for a product line they planned to drop</li></ul><p>That changed their launch plan before money was spent. Without local data, they would have optimized for conditions that were not real.</p><h2 id="How-to-Implement-Smart-AI-Proxy-in-Production"><a href="#How-to-Implement-Smart-AI-Proxy-in-Production" class="headerlink" title="How to Implement Smart AI Proxy in Production"></a>How to Implement Smart AI Proxy in Production</h2><p>If you already operate crawlers or data pipelines, Smart AI Proxy does not require you to rethink your setup. There is no browser layer to maintain and no new orchestration model to introduce. It slots into existing HTTP-based workflows.</p><p><strong>Step 1: Get your Authentication Key:</strong> Get your Crawlbase authentication key from the <a href="https://crawlbase.com/dashboard/smartproxy">dashboard</a>. New accounts receive 5,000 free requests for testing.</p><p><strong>Step 2: Install Dependencies</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install requests urllib3</span><br></pre></td></tr></table></figure><p><strong>Step 3: Send your first geo-targeted request:</strong> Use the examples in this guide or a ready-made script from <a href="https://github.com/ScraperHub/how-to-access-region-locked-data-with-smart-ai-proxy">ScraperHub</a>. You only need to route traffic through the Smart AI Proxy endpoint and set request-level parameters.</p><p><strong>Step 4: Prepare for production</strong></p><p>At this stage, you treat it like any other data pipeline:</p><ul><li>Add retries and basic error handling</li><li>Apply rate limits aligned with your plan</li><li>Monitor response anomalies instead of raw failure counts</li><li>Store raw HTML alongside parsed output for verification</li></ul><p><strong>Step 5: Optimize Costs</strong></p><ul><li>Use normal requests when <a href="https://crawlbase.com/docs/smart-proxy/headless-browsers/">Headless Browsers</a> are not needed (cuts cost in half)</li><li>Cache pages that change infrequently</li><li>Batch requests to reduce overhead</li></ul><p>Ready to Scale Your Geo-Targeted Data Collection?</p><p>Geo-targeted scraping doesn’t need VPNs, managing proxy pools, or automating browsers when you control location at the request level. Smart AI Proxy automatically handles IP selection, rotation, block mitigation, and ZIP-level cookie management. You simply specify the country and ZIP code in your headers.</p><p>Whether you are monitoring Amazon prices in different markets, tracking local SERPs, or collecting competitive intelligence by area, this method scales from testing to production without extra work.</p><p><a href="https://crawlbase.com/signup?signup=blog">Sign up for Crawlbase</a> to receive 5,000 free requests and test geo-targeted scraping for your specific use case. Compare the results against your current setup; most teams notice the improvement in data accuracy right away.</p><h2 id="Frequently-Asked-Questions-FAQs"><a href="#Frequently-Asked-Questions-FAQs" class="headerlink" title="Frequently Asked Questions (FAQs)"></a>Frequently Asked Questions (FAQs)</h2><h3 id="Q-How-many-countries-does-Smart-AI-Proxy-support"><a href="#Q-How-many-countries-does-Smart-AI-Proxy-support" class="headerlink" title="Q: How many countries does Smart AI Proxy support?"></a>Q: How many countries does Smart AI Proxy support?</h3><p><strong>A:</strong> Smart AI Proxy supports over 195 countries for country-level targeting. For ZIP&#x2F;postal code targeting on Amazon, it supports more than 20 countries, including the US, Canada, UK, Germany, France, Japan, India, Australia, and key markets across Europe, Asia-Pacific, and the Middle East. All ZIP codes are pre-validated to ensure compatibility. ​</p><h3 id="Q-Can-I-target-specific-cities-within-a-country"><a href="#Q-Can-I-target-specific-cities-within-a-country" class="headerlink" title="Q: Can I target specific cities within a country?"></a>Q: Can I target specific cities within a country?</h3><p><strong>A:</strong> Yes, for Amazon scraping, you can achieve city-level accuracy using the zipcode parameter (e.g., <code>country=US&amp;zipcode=10001</code> for New York City). For other sites, city-level targeting depends on how the target website uses geolocation. Most sites respond to country-level IP targeting, while some consider additional headers and cookies, which Smart AI Proxy manages automatically. ​</p><h3 id="Q-What’s-the-difference-between-country-and-zipcode-parameters"><a href="#Q-What’s-the-difference-between-country-and-zipcode-parameters" class="headerlink" title="Q: What’s the difference between country and zipcode parameters?"></a>Q: What’s the difference between country and zipcode parameters?</h3><p><strong>A:</strong> The country parameter targets broad geo-locked content like currency, language, and regional availability. The zipcode parameter, currently for Amazon, adds context for the delivery location, affecting pricing, taxes, shipping costs, and local inventory. For example, <code>country=US</code> shows USD pricing, while <code>country=US&amp;zipcode=90210</code> shows exact pricing with California sales tax and delivery estimates for Beverly Hills. ​</p><h3 id="Q-Can-I-use-Smart-AI-Proxy-for-websites-other-than-Amazon"><a href="#Q-Can-I-use-Smart-AI-Proxy-for-websites-other-than-Amazon" class="headerlink" title="Q: Can I use Smart AI Proxy for websites other than Amazon?"></a>Q: Can I use Smart AI Proxy for websites other than Amazon?</h3><p><strong>A:</strong> Yes. Smart AI Proxy works with most websites, including Google, e-commerce platforms, local marketplaces, and SERP tracking. The country parameter works universally. ZIP-level targeting is currently optimized specifically for Amazon across more than 20 countries.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Accessing geo-locked data at scale requires more than just IP rotation. You need precise control over country and ZIP code targeting, along with automatic handling of blocks, sessions, and location-specific cookies. Traditional VPNs and proxy pools struggle when you need ZIP-accurate pricing from Amazon or country-specific SERPs from Google.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="smart proxy" scheme="https://crawlbase.com/blog/tags/smart-proxy/"/>
    
    <category term="ai Proxy" scheme="https://crawlbase.com/blog/tags/ai-Proxy/"/>
    
    <category term="unblock data" scheme="https://crawlbase.com/blog/tags/unblock-data/"/>
    
    <category term="access geo-locked data" scheme="https://crawlbase.com/blog/tags/access-geo-locked-data/"/>
    
  </entry>
  
  <entry>
    <title>Introducing the New Crawlbase Dashboard</title>
    <link href="https://crawlbase.com/blog/new-crawlbase-dashboard/"/>
    <id>https://crawlbase.com/blog/new-crawlbase-dashboard/</id>
    <published>2026-02-09T18:37:02.000Z</published>
    <updated>2026-04-24T11:53:23.551Z</updated>
    
    <content type="html"><![CDATA[<p>We are excited to introduce the completely redesigned <a href="https://crawlbase.com/signup?signup=blog">Crawlbase Dashboard</a>. This upgrade offers more than just a new look. It provides a fresh experience that changes how you manage your crawling projects, track usage, and scale your data operations confidently.</p><span id="more"></span><p>Whether you are an experienced user or just starting your first crawl, the new dashboard helps you understand your data more quickly and easily.</p><img src="/blog/new-crawlbase-dashboard/crawlbase-dashboard-home-screen.jpg" class="" title="An image shows the new Crawlbase dashboard " alt="An image shows the new Crawlbase dashboard"><h2 id="What’s-New-Top-Functional-Wins"><a href="#What’s-New-Top-Functional-Wins" class="headerlink" title="What’s New: Top Functional Wins"></a>What’s New: Top Functional Wins</h2><p>The new Crawlbase dashboard focuses on three core pillars to improve your daily workflow:</p><p><strong>Faster Setup</strong>: Get your projects up and running with significantly fewer steps.</p><p><strong>Clearer Usage Data</strong>: Gain better insights into how your requests are performing with transparent data visualization.</p><p><strong>Billing at a Glance</strong>: Monitor your limits and billing status (in USD) directly from the main view to avoid any service interruptions.</p><h2 id="Built-to-Fix-Real-Pain-Points"><a href="#Built-to-Fix-Real-Pain-Points" class="headerlink" title="Built to Fix Real Pain Points"></a>Built to Fix Real Pain Points</h2><p>To make your experience more intuitive and efficient, we added the following features to the new dashboard:</p><ul><li><strong>A unified left-hand navigation</strong> that connects our entire array of products in one place.</li><li><strong>Clearer limits and trial status</strong> are in full view, so you always know where you stand.</li><li><strong>Easier onboarding</strong> ensures new users can make their first API call faster than ever.</li></ul><img src="/blog/new-crawlbase-dashboard/crawling-api-dashboard-screen.jpg" class="" title="An image shows the new Crawlbase dashboard usage " alt="An image shows the new Crawlbase dashboard usage"><p>When you log in for the first time, you will immediately notice a modern and cleaner UI designed for maximum readability. Beyond the aesthetics, the dashboard now features real-time charts and data with advanced filters, allowing you to drill down into your usage logs and monitoring views instantly.</p><ul><li><strong>Visual Proof</strong>: The new dashboard provides a comprehensive “Usage Overview” including total requests, success rates, and even a breakdown of JavaScript vs. regular requests.</li><li><strong>Unified Product Access</strong>: The new dashboard brings all Crawlbase products and custom scrapers into one unified view, including the <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawling API</a>, <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">Enterprise Crawler</a>, <a href="https://crawlbase.com/smart-proxy">Smart AI Proxy</a>, <a href="https://crawlbase.com/cloud-storage-for-crawling-and-scraping">Cloud Storage</a>, the <a href="https://crawlbase.com/mcp">Crawlbase MCP Server</a>, and custom scrapers such as the <a href="https://crawlbase.com/facebook-scraper">Facebook scraper</a> and the <a href="https://crawlbase.com/linkedin-scraper">LinkedIn scraper</a>.</li><li><strong>Affiliate Earnings</strong>: Get full visibility and control over the earnings you generate by referring new users to Crawlbase.</li><li><strong>Support</strong>: Access the help you need based on your product queries from the new interface.</li></ul><h3 id="Available-to-Everyone"><a href="#Available-to-Everyone" class="headerlink" title="Available to Everyone"></a>Available to Everyone</h3><p>This update is fully available to both new and existing Crawlbase users. All users have access to these new features and the improved interface at the same time. In addition, you can visit our tutorials from the dashboard to learn practical data extraction methods.</p><h3 id="Get-Started-Today"><a href="#Get-Started-Today" class="headerlink" title="Get Started Today"></a>Get Started Today</h3><p>We want you to feel right at home in your new workspace. The best way to get started is to take a tour of the new interface to see where all your favorite tools have moved and explore the new monitoring capabilities. Head to your <a href="https://crawlbase.com/signup?signup=blog">Crawlbase Dashboard</a>.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;We are excited to introduce the completely redesigned &lt;a href=&quot;https://crawlbase.com/signup?signup=blog&quot;&gt;Crawlbase Dashboard&lt;/a&gt;. This upgrade offers more than just a new look. It provides a fresh experience that changes how you manage your crawling projects, track usage, and scale your data operations confidently.&lt;/p&gt;</summary>
    
    
    
    <category term="advanced web scraping tutorials" scheme="https://crawlbase.com/blog/categories/advanced-web-scraping-tutorials/"/>
    
    
    <category term="new crawlbase dashboard" scheme="https://crawlbase.com/blog/tags/new-crawlbase-dashboard/"/>
    
    <category term="web scraping dashboard" scheme="https://crawlbase.com/blog/tags/web-scraping-dashboard/"/>
    
    <category term="crawling api management" scheme="https://crawlbase.com/blog/tags/crawling-api-management/"/>
    
    <category term="data extraction platform" scheme="https://crawlbase.com/blog/tags/data-extraction-platform/"/>
    
    <category term="web crawler interface" scheme="https://crawlbase.com/blog/tags/web-crawler-interface/"/>
    
    <category term="api usage monitoring" scheme="https://crawlbase.com/blog/tags/api-usage-monitoring/"/>
    
  </entry>
  
  <entry>
    <title>Best Proxy + Scraping API Stack for Startups in 2026</title>
    <link href="https://crawlbase.com/blog/best-proxy-scraping-api-for-startups/"/>
    <id>https://crawlbase.com/blog/best-proxy-scraping-api-for-startups/</id>
    <published>2026-02-04T13:22:32.000Z</published>
    <updated>2026-04-24T11:53:22.907Z</updated>
    
    <content type="html"><![CDATA[<p>If your business depends on web data, your web scraping stack matters more than most teams expect. The wrong setup looks fine at first, then collapses under real traffic and scrutiny. The right setup stays stable as volume grows, costs stay predictable, and your engineers stay focused on product work.</p><span id="more"></span><p>For most businesses and especially startups, the best proxy + scraping API stack is:</p><p>Python (or your preferred language) + <a href="https://crawlbase.com/signup?signup=blog">Crawlbase</a>.</p><p>Crawlbase beats alternatives because it starts at $3&#x2F;1K requests (vs. $49&#x2F;month minimums elsewhere), integrates in 5 minutes, and scales without rebuilding your stack. You get proxy rotation, JavaScript rendering, anti-bot handling, and retries, without DIY infrastructure or enterprise pricing.</p><div class="secondary-cta-banner">  <div class="gradient-bg">    <h3 class="banner-title">Get Started with 1,000 Free Requests</h3>    <p class="banner-desc">Try our <strong class="text-underline">Crawling API</strong>  to automate your data collection — used by 70k+ dev teams</p>    <div class="banner-features">      <ul class="features-list">        <li>Handles JS heavy websites</li>        <li>Built-in proxy rotation</li>        <li>No credit card needed</li>      </ul>      <a class="banner-btn" href="/signup?signup=blog-smart-cta" title="Get Started Now!" onclick="gtag('event', 'smart_cta_click', { 'blog_group': 'crawling_api', 'blog_slug': 'best-proxy-scraping-api-for-startups', 'cta_type': 'try_crawling_api', 'cta_position': 'top','cta_version': 'crawling_api_v2', 'page_location': 'https://crawlbase.com/blog/best-proxy-scraping-api-for-startups/', 'page_title': 'Best Proxy + Scraping API Stack for Startups in 2026' }); });">Get Started Now!</a>    </div>  </div>  </div><h2 id="Why-Most-Scraping-Setups-Fail-at-Scale"><a href="#Why-Most-Scraping-Setups-Fail-at-Scale" class="headerlink" title="Why Most Scraping Setups Fail at Scale"></a>Why Most Scraping Setups Fail at Scale</h2><p>Most teams start with the simplest approach:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> bs4 <span class="keyword">import</span> BeautifulSoup</span><br><span class="line"></span><br><span class="line">response = requests.get(<span class="string">&quot;https://example.com&quot;</span>)</span><br><span class="line">soup = BeautifulSoup(response.text, <span class="string">&quot;html.parser&quot;</span>)</span><br><span class="line"><span class="built_in">print</span>(soup.find(<span class="string">&quot;h1&quot;</span>).text)</span><br></pre></td></tr></table></figure><p>It looks fine until you increase the volume. Once you scale beyond ~10,000 requests&#x2F;day, the same <a href="https://www.akamai.com/blog/security/the-web-scraping-problem-part-1">scraping problems</a> show up almost every time:</p><ul><li>IP bans after repeated requests</li><li>CAPTCHAs and challenge pages</li><li>JavaScript-heavy websites where HTML is incomplete without rendering</li><li>rate limiting and throttling</li><li>unstable success rates that break data pipelines</li><li>infrastructure overhead (proxies, browsers, retries, monitoring)</li></ul><p>At that point, scraping stops being a “small feature” and becomes an ongoing engineering cost.</p><h2 id="What’s-Included-in-Crawlbase’s-Web-Scraping-Stack"><a href="#What’s-Included-in-Crawlbase’s-Web-Scraping-Stack" class="headerlink" title="What’s Included in Crawlbase’s Web Scraping Stack"></a>What’s Included in Crawlbase’s Web Scraping Stack</h2><p>Crawlbase replaces the complicated parts of scraping with one API call. Instead of stitching together multiple tools, you get a single startup-friendly setup that’s fast to integrate and easy to scale.</p><table><thead><tr><th>Layer</th><th>Purpose</th><th>DIY Approach</th><th>Crawlbase Approach</th></tr></thead><tbody><tr><td>Rotating Proxies</td><td>Avoid IP bans by distributing requests across millions of IPs</td><td>Rent proxy pools, manage rotation logic</td><td>140M residential + 98M datacenter proxies included</td></tr><tr><td>Browser Rendering</td><td>Execute JavaScript to scrape dynamic content</td><td>Run Puppeteer&#x2F;Selenium clusters</td><td>Use JavaScript token or create a JavaScript Crawler</td></tr><tr><td>Anti-Bot Bypass</td><td>Solve CAPTCHAs and bypass detection</td><td>Integrate CAPTCHA solving APIs</td><td>Automatic bypass included</td></tr><tr><td>Retry Logic</td><td>Handle failures gracefully</td><td>Write custom retry code</td><td>Automatic with exponential backoff (Enterprise Crawler)</td></tr><tr><td>API Abstraction</td><td>Simple integration</td><td>Build and maintain your own API wrapper</td><td>Clean REST API, 5-minute setup</td></tr></tbody></table><p>In practice, scraping is not a single problem but a stack of challenges that must be handled together. Modern websites apply multiple layers of defenses and rendering logic. Crawlbase works well because it addresses these layers as a unified system rather than leaving teams to work around each issue independently.</p><div class="callout-banner">  <div class="banner-header">    <img src="/blog/images/flashlight-icon-blue.png" srcset="/blog/images/flashlight-icon-blue.png 1x, /blog/images/flashlight-icon-blue@2x.png 2x" alt="Flashlight Icon"/>    <h2 class="banner-header-label">Scale Without the Speed Wobbles</h2>  </div>  <p class="banner-body">In recent benchmarks, Crawlbase maintained consistent response times even as request volume quintupled. Whether you're running 2 or 10 req/s, we provide the steady performance your data pipeline needs.</p>  <div class="banner-footer">   <a href="https://crawlbase.com/signup?signup=blog-callout-cta" title="Build a Scalable Scraper">Build a Scalable Scraper</a>    <img src="/blog/images/arrow-right-double-green.png" srcset="/blog/images/arrow-right-double-green.png 1x, /blog/images/arrow-right-double-green@2x.png 2x" alt="Arrow right double Icon"/>  </div></div><h2 id="Crawlbase-Pricing-What-You-Actually-Pay"><a href="#Crawlbase-Pricing-What-You-Actually-Pay" class="headerlink" title="Crawlbase Pricing: What You Actually Pay"></a>Crawlbase Pricing: What You Actually Pay</h2><p>A common mistake is thinking that the web scraping cost is only “proxy cost.” In reality, businesses pay for:</p><ul><li>proxy pool subscriptions</li><li>headless browser compute</li><li>CAPTCHA solving services</li><li>developer time spent debugging blocks and failures</li><li>lost data from failed scrapes and reruns</li></ul><p>Crawlbase is cost-effective because it reduces these hidden costs and keeps usage predictable.</p><p>Key reasons it works for startups and businesses:</p><ul><li>Request-based pricing that’s easy to budget</li><li>No separate proxy vendor to manage</li><li>No browser cluster required for most use cases</li><li>Less engineering time wasted on scraping maintenance</li></ul><p>Pricing examples and ROI calculations depend on your workload, so you can keep these as placeholders:</p><ul><li><a href="https://crawlbase.com/pricing">Crawlbase’s pricing</a> starts at $3.00 per 1,000 requests, up to $0.02 per 1,000 for high volumes</li><li>Estimated monthly savings vs DIY: $2,000-$6,000 per month</li><li>Reduced maintenance hours per month: 30-60 engineering hours per month</li></ul><p>For most startups, the real benefit is not just lower infrastructure spend, but fewer engineering hours lost to maintaining scraping systems that are not core to the product.</p><p>Shifting proxy management, browser rendering, retries, and anti-bot handling to Crawlbase can keep costs predictable while redirecting time and budget toward building features that actually drive revenue.</p><h2 id="How-to-Integrate-Crawlbase-5-Minute-Setup"><a href="#How-to-Integrate-Crawlbase-5-Minute-Setup" class="headerlink" title="How to Integrate Crawlbase (5-Minute Setup)"></a>How to Integrate Crawlbase (5-Minute Setup)</h2><p>Integration is intentionally simple. A basic request looks like this:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line">response = requests.get(</span><br><span class="line">    <span class="string">&quot;https://api.crawlbase.com/&quot;</span>,</span><br><span class="line">    params=&#123;<span class="string">&quot;token&quot;</span>: <span class="string">&quot;YOUR_TOKEN&quot;</span>, <span class="string">&quot;url&quot;</span>: <span class="string">&quot;https://target-site.com&quot;</span>&#125;</span><br><span class="line">)</span><br><span class="line"><span class="built_in">print</span>(response.text)</span><br></pre></td></tr></table></figure><p>That is enough to start pulling HTML reliably without managing proxies or retries yourself.</p><p>Crawlbase also provides free-to-use <a href="https://crawlbase.com/dashboard/api/libraries">libraries and SDKs</a> (no extra cost) for common languages and tools, including:</p><ul><li>Node.js</li><li>PHP</li><li><a href="https://www.python.org/">Python</a></li><li>Ruby</li><li>.NET</li><li>Java</li><li>Scrapy middleware</li><li>Zapier Create Hook</li></ul><p>This makes Crawlbase practical for startups because your team can integrate it into the stack you already use, with minimal extra code and setup.</p><h2 id="Scaling-from-1K-to-1M-Requests-with-Crawlbase"><a href="#Scaling-from-1K-to-1M-Requests-with-Crawlbase" class="headerlink" title="Scaling from 1K to 1M+ Requests with Crawlbase"></a>Scaling from 1K to 1M+ Requests with Crawlbase</h2><p>Crawlbase is built to scale with your business, from early-stage use cases to large-volume production workloads.</p><h3 id="Crawlbase-Crawling-API-small-to-large-scale"><a href="#Crawlbase-Crawling-API-small-to-large-scale" class="headerlink" title="Crawlbase Crawling API (small to large scale)"></a>Crawlbase Crawling API (small to large scale)</h3><p>The <a href="https://crawlbase.com/crawling-api-avoid-captchas-blocks">Crawling API</a> is ideal when you need:</p><ul><li>simple per-request scraping</li><li>fast integration</li><li>predictable usage-based cost</li><li>support for both static and JavaScript-heavy pages</li></ul><p>This is the best starting point for startups and most business scraping workflows.</p><h3 id="Crawlbase-Enterprise-Crawler-large-scale"><a href="#Crawlbase-Enterprise-Crawler-large-scale" class="headerlink" title="Crawlbase Enterprise Crawler (large scale)"></a>Crawlbase Enterprise Crawler (large scale)</h3><p>When you need to scrape at a very high volume, Crawlbase also offers the <a href="https://crawlbase.com/anonymous-crawler-asynchronous-scraping">Enterprise Crawler</a>, designed for:</p><ul><li>high concurrency crawling</li><li>asynchronous processing (ideal for large jobs)</li><li>handling large URL batches efficiently</li><li>long-running crawls without babysitting infrastructure</li></ul><p>This is a common upgrade path for startups once they move from “scrape a few pages” to “scrape millions of pages reliably.”</p><h2 id="Crawlbase-vs-ScraperAPI-Oxylabs-ScrapingBee-and-Apify"><a href="#Crawlbase-vs-ScraperAPI-Oxylabs-ScrapingBee-and-Apify" class="headerlink" title="Crawlbase vs ScraperAPI, Oxylabs, ScrapingBee, and Apify"></a>Crawlbase vs ScraperAPI, Oxylabs, ScrapingBee, and Apify</h2><p>If your goal is a startup-friendly scraping stack, the decision should be driven by three practical factors:</p><ul><li><strong>Setup time -</strong> how quickly your team can go from zero to production</li><li><strong>Cost predictability -</strong> how easy it is to forecast monthly spend</li><li><strong>Scalability -</strong> whether the solution grows with your product without a rebuild</li></ul><p>Many scraping tools work well in isolation, but not all of them are optimized for startups with limited budget and engineering bandwidth. The table below compares Crawlbase with common alternatives through that lens.</p><table><thead><tr><th>Solution</th><th>Starting Price</th><th>Cost Tradeoffs</th><th>Strengths</th><th>Best For</th><th>Startup-Friendly?</th></tr></thead><tbody><tr><td><a href="https://crawlbase.com/pricing">Crawlbase</a></td><td>$3.00&#x2F;1K requests up to $0.02&#x2F;1K for high volumes</td><td>May increase depending on target website complexity</td><td>Cost-effective, easy integration, scalable, low setup overhead</td><td>Startups and businesses needing reliable scraping</td><td>YES</td></tr><tr><td><a href="https://crawlbase.com/blog/best-scraperapi-alternative/">ScraperAPI</a></td><td>$49&#x2F;month</td><td>Subscription-based, High entry cost</td><td>Easy integration, managed proxies, JS rendering</td><td>Simple scraping API with minimal setup</td><td>Maybe</td></tr><tr><td><a href="https://crawlbase.com/blog/oxylabs-alternative-for-web-scraping/">Oxylabs</a></td><td>$49&#x2F;month</td><td>Subscription-based, High entry cost</td><td>Extensive proxy infrastructure with a large global IP pool</td><td>Businesses and enterprises needing advanced proxy solutions</td><td>No</td></tr><tr><td><a href="https://crawlbase.com/blog/scrapingbee-alternative-for-web-scraping/">ScrapingBee</a></td><td>$49&#x2F;month</td><td>Subscription-based, High entry cost</td><td>Easy setup, documentation</td><td>Simple to moderate scraping projects with dynamic pages</td><td>Maybe</td></tr><tr><td><a href="https://crawlbase.com/blog/apify-alternative-for-web-scraping/">Apify</a></td><td>$0.40&#x2F;CU</td><td>Difficult to estimate “per Compute Unit”</td><td>Flexible actors and workflows</td><td>Teams needing customizable scraping workflows</td><td>Maybe</td></tr></tbody></table><ul><li><strong>Crawlbase</strong> is optimized for startups and enterprise teams because pricing scales with usage, setup takes minutes, and there is no need to manage proxies, browsers, or retries. This keeps both engineering effort and costs low.</li><li><strong>ScraperAPI</strong> and <strong>ScrapingBee</strong> are easy to integrate, but their subscription-based pricing can be inefficient for early-stage startups or variable workloads.</li><li><strong>Oxylabs</strong> excels at proxy infrastructure, but its pricing and complexity are better suited to enterprise teams.</li><li><strong>Apify</strong> is powerful for automation-heavy workflows, but cost predictability can be challenging when scraping volume grows.</li></ul><h2 id="Final-Verdict-Why-Crawlbase-Is-Startup-Friendly"><a href="#Final-Verdict-Why-Crawlbase-Is-Startup-Friendly" class="headerlink" title="Final Verdict: Why Crawlbase Is Startup-Friendly"></a>Final Verdict: Why Crawlbase Is Startup-Friendly</h2><p>For businesses that need web data, Crawlbase is one of the most practical stacks you can adopt. For startups, it’s even more valuable because it removes the two biggest constraints:</p><ul><li><strong>Low budget -</strong> You avoid proxy infrastructure overhead, reduce wasted spend, and keep costs predictable.</li><li><strong>Low setup overhead -</strong> You integrate quickly, ship faster, and avoid spending weeks building scraping infrastructure.</li></ul><p>Crawlbase is startup-friendly because you can:</p><ul><li>Start small with the Crawling API</li><li>Scale reliably as volume grows</li><li>Move to the Enterprise Crawler for high concurrency and large-volume asynchronous crawling</li></ul><p><a href="https://crawlbase.com/signup?signup=blog">Create a Crawlbase account</a> now if you want a scraping stack that works today and still works when your business scales.</p><h2 id="Frequently-Asked-Questions"><a href="#Frequently-Asked-Questions" class="headerlink" title="Frequently Asked Questions"></a>Frequently Asked Questions</h2><h3 id="Q-When-does-DIY-scraping-stop-being-practical-for-startups"><a href="#Q-When-does-DIY-scraping-stop-being-practical-for-startups" class="headerlink" title="Q. When does DIY scraping stop being practical for startups?"></a>Q. When does DIY scraping stop being practical for startups?</h3><p>DIY scraping usually becomes unreliable once usage reaches ~10,000 requests per day. At that point, IP bans, CAPTCHA, JavaScript rendering, and rate limiting start appearing consistently. Modern websites actively deploy bot mitigation, which makes simple request-based scrapers hard to maintain at scale.</p><h3 id="Q-Do-I-need-to-manage-proxies-browsers-or-CAPTCHA-solvers-with-Crawlbase"><a href="#Q-Do-I-need-to-manage-proxies-browsers-or-CAPTCHA-solvers-with-Crawlbase" class="headerlink" title="Q. Do I need to manage proxies, browsers, or CAPTCHA solvers with Crawlbase?"></a>Q. Do I need to manage proxies, browsers, or CAPTCHA solvers with Crawlbase?</h3><p>No. Crawlbase handles proxy rotation, JavaScript execution, anti-bot challenges, and retries automatically (Enterprise Crawler). This is important because many websites rely on client-side JavaScript execution to generate the final <a href="https://developer.mozilla.org/en-US/docs/Glossary/DOM">DOM</a>, not just static HTML.</p><h3 id="Q-How-does-Crawlbase-scale-from-small-projects-to-large-volumes"><a href="#Q-How-does-Crawlbase-scale-from-small-projects-to-large-volumes" class="headerlink" title="Q. How does Crawlbase scale from small projects to large volumes?"></a>Q. How does Crawlbase scale from small projects to large volumes?</h3><p>Most startups begin with the <strong>Crawling API</strong> for per-request scraping. As volume grows, the <strong>Enterprise Crawler</strong> supports high concurrency and asynchronous jobs without requiring a rebuild. This lets teams scale from thousands to millions or even billions of requests using the same stack.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;If your business depends on web data, your web scraping stack matters more than most teams expect. The wrong setup looks fine at first, then collapses under real traffic and scrutiny. The right setup stays stable as volume grows, costs stay predictable, and your engineers stay focused on product work.&lt;/p&gt;</summary>
    
    
    
    <category term="crawling and scraping learning" scheme="https://crawlbase.com/blog/categories/crawling-and-scraping-learning/"/>
    
    
    <category term="best proxy 2026" scheme="https://crawlbase.com/blog/tags/best-proxy-2026/"/>
    
    <category term="best scraping api" scheme="https://crawlbase.com/blog/tags/best-scraping-api/"/>
    
  </entry>
  
</feed>
