Web Data for AI and ML Training

Every artificial intelligence and machine learning system is, at bottom, an argument from data. A model does not reason from first principles; it generalizes from examples it has already seen. Change the examples and you change the behavior. That is why the most consequential decision in any AI project is rarely the architecture or the optimizer, it is what goes into the training set and where that data comes from.

For a long stretch the answer was "a few clean academic datasets." That era is over. The large language models and modern ML systems people use every day were trained on enormous corpora assembled from the open web: articles, product catalogs, forums, documentation, reviews, code. If you want to build something competitive, you eventually have to collect web data yourself, at scale, and turn it into training rows. This article explains where AI and ML data actually comes from, why the web is the dominant source, and how to collect it reliably with the Crawling API and Smart AI Proxy instead of fighting blocks by hand.

It helps to be precise about the two terms, because they get used interchangeably and they are not the same thing. Artificial intelligence is the broad goal: systems that perform tasks we associate with human intelligence, from answering questions to driving a car. Machine learning is the dominant method for getting there. Rather than hand-coding rules for every situation, you give an algorithm a large set of examples and let it learn the patterns that map inputs to outputs.

Machine learning splits along how those examples are labeled. Supervised learning trains on labeled pairs (an email and the tag "spam," a product page and its category) and learns to reproduce the labels on new inputs. Unsupervised learning gets no labels and instead finds structure on its own, clustering similar items or compressing data into useful representations. The large language models behind today's AI boom lean heavily on a self-supervised variant: the model learns by predicting masked or next tokens in raw text, which means the text itself supplies the signal and you need a lot of it.

The one constant

Supervised, unsupervised, or self-supervised, every approach has the same dependency: a large, representative, current dataset. The algorithm is mostly a fixed recipe. The data is the variable that decides whether the result is sharp or useless, which is why "where do I get the data" is the real engineering question behind most AI and ML work.

Why models live or die on their data

It is tempting to treat the model as the clever part and the data as plumbing. In practice the ratio is reversed. Three properties of a dataset shape almost everything downstream.

Volume. Modern models have millions to billions of parameters, and each parameter is a degree of freedom that needs evidence to set correctly. Too little data and the model memorizes its examples instead of generalizing, the failure mode known as overfitting. The reason web-scale corpora matter is simply that the web is one of the only places you can find enough varied text and structured records to fill that appetite.

Freshness. A model trained on a snapshot from three years ago believes the world looks the way it did three years ago. Prices, product lines, slang, news, and code idioms all drift. If your application reasons about the current world, your training and evaluation data has to be collected from the current web, not pulled from a stale archive.

Representativeness. A model only learns the distribution it sees. Scrape one retailer and your price model knows that retailer; scrape twenty and it learns the market. Bias in equals bias out, so the breadth of your collection directly bounds how well the model generalizes beyond the slice you happened to gather.

Notice that all three properties are collection problems, not modeling problems. You cannot optimize your way out of a dataset that is too small, too old, or too narrow. That is why teams building serious AI and ML systems spend so much of their time on data acquisition, and why the web crawling layer underneath deserves real attention.

Where AI and ML data comes from

Training data for machine learning arrives through a handful of channels, and most real projects mix several.

Curated public datasets (ImageNet, Common Crawl, open government data) are a fine starting point and cost nothing, but everyone else trains on them too, so they rarely give you an edge and they are often out of date. Manual labeling and in-house collection produce high-quality, task-specific data, but they are slow and expensive and do not scale to web-sized corpora. Licensed data from providers fills specific gaps when budget allows.

For most teams the decisive source is the open web itself, gathered through web scraping. The web is the largest, freshest, and most diverse body of text and structured records in existence, and it covers nearly every domain you might want to model: e-commerce listings for price intelligence, reviews for sentiment analysis, job postings for labor-market models, documentation and forum threads for code and Q&A systems. The catch is that collecting it reliably is harder than it looks, which is the gap the next section addresses.

The hard part: collecting web data at scale

Writing a script that fetches one page is trivial. Writing one that fetches a million pages across dozens of sites, week after week, without grinding to a halt is a different problem. Three obstacles show up almost immediately.

First, anti-bot defenses. Commercial sites watch for automated traffic and respond with CAPTCHAs, rate limits, and IP bans. A naive scraper from a single datacenter IP gets flagged within minutes, and a dataset that stops filling halfway through is no dataset at all.

Second, client-side rendering. A large share of the modern web builds its content in the browser with JavaScript. A plain requests.get returns the HTML shell with none of the data you came for, so you need something that actually runs the page like a real browser before you read it.

Third, scale and reliability. Proxies need rotating, failures need retrying, and the whole pipeline has to keep running unattended. Building and babysitting that infrastructure is most of the work, and it has nothing to do with the model you actually want to train.

This is exactly where Crawlbase fits. The Crawling API takes a URL, fetches it through rotating residential IPs, renders the JavaScript in a real browser, and hands back finished HTML or LLM-ready markdown in a single call. The Smart AI Proxy (also called the AI Proxy) gives you the same residential rotation as a drop-in proxy endpoint when you would rather route your own client's traffic. Either way, the blocking, rendering, and IP management stop being your problem, and you can spend your effort on the dataset.

Crawlbase Crawling API

Training data is only useful if you can actually collect it. The Crawling API fetches pages through rotating residential IPs and renders JavaScript in a real browser, returning clean HTML or markdown from one call, so a million-page collection run keeps filling instead of stalling on CAPTCHAs. Point it at a public page on the free tier and see a row land in your dataset.

Start free

Building a training dataset: a runnable example

Concepts are easier to trust when you can run them. The script below collects a small structured dataset from a public bookstore test site, the kind of catalog data you might feed a price or category model. It fetches each page through the Crawling API, pulls a few fields, and writes one row per item to both CSV and JSON so the result drops straight into a training pipeline.

You need Python 3.8 or later and the official Crawlbase client. Install it, and grab a token from your Crawlbase dashboard after signing up.

bash

python -m venv ml_data_env
source ml_data_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate the environment with ml_data_env\Scripts\activate instead of the source line. Now the collector. It crawls a few catalog pages, extracts the fields with BeautifulSoup, and saves clean rows.

python

import csv
import json
from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

pages = [
    "https://books.toscrape.com/catalogue/page-1.html",
    "https://books.toscrape.com/catalogue/page-2.html",
    "https://books.toscrape.com/catalogue/page-3.html",
]

def fetch_html(url):
    options = {"ajax_wait": "true", "page_wait": 2000}
    response = api.get(url, options)
    if response["status_code"] != 200:
        raise RuntimeError(f"Fetch failed for {url}: {response['status_code']}")
    return response["body"].decode("utf-8")

def parse_rows(html):
    soup = BeautifulSoup(html, "html.parser")
    rows = []
    for card in soup.select("article.product_pod"):
        title = card.select_one("h3 a")["title"].strip()
        price = card.select_one(".price_color").text.strip()
        rating = card.select_one(".star-rating")["class"][1]
        in_stock = "In stock" in card.select_one(".availability").text
        rows.append({
            "title": title,
            "price": price,
            "rating": rating,
            "in_stock": in_stock,
        })
    return rows

def build_dataset():
    dataset = []
    for url in pages:
        try:
            dataset.extend(parse_rows(fetch_html(url)))
        except RuntimeError as err:
            print(f"Skipping page: {err}")
    return dataset

data = build_dataset()
print(f"Collected {len(data)} rows")

The two wait options keep the same code working when you point it at a client-rendered target: ajax_wait holds for asynchronous content and page_wait pauses a fixed number of milliseconds so late elements appear before capture. The status-code check turns a failed fetch into a clear error instead of a silently empty row, which matters when a run spans thousands of pages and you need to trust what landed in the dataset.

Saving the rows for training

A dataset is only useful in a format your training code can read. CSV suits tabular models and quick inspection in a spreadsheet; JSON suits nested records and most Python pipelines. Write both so the same collection feeds either path.

python

def save_csv(rows, path="training_data.csv"):
    if not rows:
        return
    with open(path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows(rows)

def save_json(rows, path="training_data.json"):
    with open(path, "w") as f:
        json.dump(rows, f, indent=2)

save_csv(data)
save_json(data)
print("Saved training_data.csv and training_data.json")

Run the two snippets together and you get clean, labeled rows ready for a model. The structure is deliberately boring: one record per item, consistent keys, no surprises. That is what training code wants.

json

[
  {
    "title": "A Light in the Attic",
    "price": "£51.77",
    "rating": "Three",
    "in_stock": true
  }
]

Collect, then clean

Raw scraped rows are not training-ready on their own. Before a model touches them you normalize types (turn that price string into a number), drop duplicates, handle missing fields, and balance categories so the set is representative. The collection step here gets you reliable raw data; the next step is shaping it, which we cover in structuring and cleaning web-scraped data for AI and ML.

From one script to a real pipeline

The example above is a demo. A production dataset for artificial intelligence and machine learning runs over thousands or millions of URLs, refreshes on a schedule, and cannot afford to stall. Three moves get you there.

Go asynchronous for volume. Fetching pages one at a time caps your throughput at the speed of a single round trip. For large jobs, push URLs to the async Crawler, which queues the work and pushes finished pages to a webhook as they complete, so you collect at scale without managing a fetch loop yourself.

Use pre-parsed output where it exists. When you collect from well-known sites repeatedly (a major retailer, a job board), the Crawling API returns structured JSON for supported targets directly, so you skip writing and maintaining selectors. For odd or one-off layouts, the BeautifulSoup approach above stays the flexible fallback. See web scraping for machine learning for more on matching the collection method to the model.

Treat freshness as a job, not a one-off. Because models drift as the world changes, schedule recurring crawls and append new rows rather than scraping once and forgetting. A dataset that updates weekly keeps your model current with current prices, current language, and current behavior.

If you are wiring an AI agent to pull live web context at inference time rather than only at training time, the Web MCP exposes the same fetching and parsing through the Model Context Protocol so a model can request pages directly. For very high volume or custom routing and SLAs, the enterprise tier covers dedicated throughput.

Recap

Key takeaways

Data is the product. In artificial intelligence and machine learning the algorithm is mostly fixed; the dataset is the variable that decides whether the model is sharp or useless.
Volume, freshness, and breadth are collection problems. You cannot optimize your way out of a dataset that is too small, too old, or too narrow, so the crawling layer deserves real attention.
The web is the dominant source. Curated datasets and manual labeling have their place, but the open web is the largest, freshest, and most diverse body of training data, gathered through web scraping.
Reliable collection is the hard part. Anti-bot defenses, client-side rendering, and scale defeat naive scrapers; the Crawling API and Smart AI Proxy handle blocking, rendering, and IP rotation for you.
Collect, then shape. Save clean rows to CSV or JSON, then normalize types, drop duplicates, and balance categories before training.
Scale with async and schedules. Move large jobs to the async Crawler, reuse pre-parsed output where it exists, and re-crawl on a schedule to keep the dataset current.

Frequently Asked Questions (FAQs)

What is the difference between artificial intelligence and machine learning?

Artificial intelligence is the broad goal of building systems that perform tasks we associate with human intelligence. Machine learning is the dominant method for reaching that goal: instead of hand-coding rules, you train an algorithm on a large set of examples and let it learn the patterns. So machine learning is a subset of artificial intelligence, and nearly all of today's notable AI systems are built with it.

Why does web scraping matter for AI and ML?

Models learn from examples, and they need a large, fresh, and representative set of them. The open web is the biggest and most current source of that data, covering almost every domain you might want to model. Web scraping is how you turn those pages into training rows, which is why data collection is one of the most important steps in any serious AI or ML project.

How much data do I need to train a model?

It depends on the model and task: a simple classifier may learn from thousands of rows, while large language models train on billions of tokens. The general rule is that more parameters need more examples to avoid overfitting, and broader coverage produces better generalization. That appetite is exactly why teams turn to web-scale collection rather than small curated sets.

Can I just use public datasets instead of scraping?

Public datasets are a good free starting point, but they have two limits: everyone else trains on them too, so they rarely give you an edge, and they are often out of date. If your application reasons about the current world or a niche domain, you will need to collect current, task-specific data from the web yourself, often alongside the public sets.

How do I collect web data without getting blocked?

A naive scraper from one datacenter IP gets flagged quickly by anti-bot defenses. The Crawling API fetches pages through rotating residential IPs and renders JavaScript in a real browser, so collection runs keep filling without tripping CAPTCHAs and rate limits. If you would rather route your own client's traffic, the Smart AI Proxy gives you the same rotation as a drop-in endpoint. The full playbook is in how to scrape websites without getting blocked.

What do I do with the data after I scrape it?

Raw scraped rows are not training-ready. You normalize data types, remove duplicates, handle missing fields, and balance categories so the set is representative, then split it into training and evaluation portions. Saving clean rows to CSV or JSON first, as in the example above, gives you a stable base to clean and shape for the model.

Ola Zeaiter

Content Marketer · Crawlbase

Content marketer who covered proxies, scraping tooling, and how teams choose a data stack on the Crawlbase blog.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

AI, ML, and the dependency they share

Why models live or die on their data

Where AI and ML data comes from

The hard part: collecting web data at scale

Building a training dataset: a runnable example

Saving the rows for training

From one script to a real pipeline

Key takeaways

Frequently Asked Questions (FAQs)

What is the difference between artificial intelligence and machine learning?

Why does web scraping matter for AI and ML?

How much data do I need to train a model?

Can I just use public datasets instead of scraping?

How do I collect web data without getting blocked?

What do I do with the data after I scrape it?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

Build an AI Research Dataset with Web MCP: Crawl Once, Reuse Forever

The infrastructure brief, in your inbox.

We use cookies

Customize cookies