A machine-learning model is only as good as the data behind it, and most of the data worth learning from lives on the public web rather than in a tidy CSV someone hands you. Product listings, prices, reviews, job posts, news, and social chatter are all generated continuously and at volume, which is exactly the kind of fresh, real-world signal a model needs. This guide shows you how to use web scraping for machine learning: why web data powers ML, how to collect it reliably at scale, how to label and structure it, and how to feed it into a training pipeline, with runnable Python at every step.

The walkthrough is scoped to public data: pages anyone can view without logging in. The Python collects HTML through the Crawlbase Crawling API, turns it into a pandas dataframe, cleans and labels the rows, and runs a basic feature-prep pass so the result is ready for a model. The goal is a repeatable collection step you can rerun on a schedule, because a dataset that goes stale is a model that quietly gets worse.

Why web data powers machine learning

Supervised models learn from examples, and the web is the largest source of examples there is. Three properties make it especially valuable for ML. It is diverse: scraping many sites gives a model the variety it needs to generalize instead of memorizing one source's quirks. It is fresh: re-running a collector keeps your training set aligned with how the world looks now, which matters most in fast-moving domains like pricing, demand, and sentiment. And it is abundant: where a hand-labeled set might run to a few thousand rows, a scraper can assemble hundreds of thousands of public records.

The catch is reliability. A one-off script that works on your laptop today is not a data source you can build a model on. Sites render content client-side, rotate their markup, and block automated traffic, so the collection layer has to be robust before anything downstream matters. That is the part that breaks ML projects in practice, and where this guide spends most of its time.

Where scraped data fits in an ML pipeline

Scraped web data shows up at several points in a project, and it helps to be clear about which one you are solving for.

  • Training data. The scraped rows become the dataset your model learns from directly, whether that is supervised, unsupervised, or semi-supervised.
  • Feature engineering. Fields you extract (text length, sentiment, price deltas, category counts) become input features that lift the predictive power of a model trained on other data.
  • Data augmentation. When a hand-labeled set is too small, scraped records expand its size and diversity so the model sees more of the space.
  • Evaluation. A freshly scraped slice held out from training is a realistic test set for checking how a model behaves on current, in-the-wild data.

The rest of this guide builds a small but complete collect-to-features pipeline you can adapt to any of those uses. For how training itself works once the data is ready, AI model training explained is a good companion read.

Why a plain fetch is not enough at scale

Collecting one page with requests is easy. Collecting a hundred thousand pages, reliably, from sites that defend against bots is where most homegrown collectors fall over. Two problems show up fast. First, many pages render their content in the browser with JavaScript, so the raw HTML you fetch is an empty shell. Second, commercial sites flag automated traffic quickly: datacenter IPs and machine-like request patterns get blocked long before you have enough rows to train on.

You can solve both yourself with a headless browser plus a pool of rotating residential proxies, but keeping that fleet healthy is most of the engineering. The Crawling API folds it into one call: you send a URL, it renders the page behind a trusted IP, rotates addresses server-side, and returns finished HTML. If a target serves clean static markup and you only want parsed fields, the Crawling API returns structured JSON directly; for raw transport with rotation you control, the Smart AI Proxy is the lower-level option. This guide uses the Crawling API because dataset collection usually spans mixed, defended sites.

Normal token vs JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. If your target is client-side rendered, use the JS token, otherwise the normal token is faster and cheaper. Pick per source, not once for the whole job.

Set up the project

You need Python 3 and pip installed. Confirm both, then create a project and install the libraries the pipeline uses.

bash
python --version
pip --version

mkdir ml-dataset && cd ml-dataset
python -m venv .venv && source .venv/bin/activate
pip install crawlbase beautifulsoup4 pandas scikit-learn

Four dependencies do the work: crawlbase is the client for the Crawling API, beautifulsoup4 parses the returned HTML, pandas holds the dataset as a dataframe, and scikit-learn handles the feature-prep at the end. You also need a Crawlbase account and a token, which you get from the dashboard after signing up. Keep it in an environment variable rather than hardcoding it.

Step 1: Collect pages via the Crawling API

Start with the collection layer, because everything downstream depends on it returning clean HTML. The Python client wraps the API in one get call. Two options matter for defended, client-side sites: ajax_wait tells the API to wait for asynchronous content, and page_wait holds for a fixed number of milliseconds after load so late-rendering content appears. The collector checks the status code so a blocked page never silently becomes a blank row in your dataset.

python
import os
import time
from crawlbase import CrawlingAPI

# JS token renders the page in a real browser before returning HTML
api = CrawlingAPI({"token": os.environ["CRAWLBASE_JS_TOKEN"]})

options = {
    "ajax_wait": "true",
    "page_wait": 5000,
}

def fetch_html(url):
    response = api.get(url, options)
    if response["status_code"] != 200:
        raise RuntimeError(f"fetch failed: {response['status_code']}")
    return response["body"].decode("utf-8")

def collect(urls):
    pages = []
    for url in urls:
        try:
            pages.append({"url": url, "html": fetch_html(url)})
        except RuntimeError as err:
            print(f"skipping {url}: {err}")
        time.sleep(1)  # pace requests so you stay unblocked
    return pages

The Crawling API rotates IPs and renders the page for you, so the collector stays small. The time.sleep between requests is deliberate: pacing keeps a long run healthy. For a dataset of any real size you will want thousands of URLs, retry logic, and concurrency, which is its own topic covered in large-scale web scraping.

Crawlbase Crawling API

Building an ML dataset means thousands of page fetches across defended sites. The Crawling API takes a token, renders the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public source on the free tier first.

Step 2: Parse pages into structured records

Raw HTML is not a dataset. The next step turns each page into a flat record with the fields you want to learn from. This example treats a product listing page as the source and pulls out name, price, rating, and review text, but the shape applies anywhere: pick the fields, map each to a selector, return a dictionary. A small helper makes a missing element an empty string rather than a crash.

python
from bs4 import BeautifulSoup

def text_or_empty(node, selector):
    el = node.select_one(selector)
    return el.get_text(strip=True) if el else ""

def parse_products(page):
    soup = BeautifulSoup(page["html"], "html.parser")
    rows = []
    for card in soup.select(".product-card"):
        rows.append({
            "name": text_or_empty(card, ".title"),
            "price": text_or_empty(card, ".price"),
            "rating": text_or_empty(card, ".rating"),
            "review": text_or_empty(card, ".review-snippet"),
            "source": page["url"],
        })
    return rows

Treat the selectors above as a starting template, not a contract: class names and data attributes change without notice, so when extraction returns empty fields, re-inspect the live page in your browser's dev tools and update them. That is normal maintenance for any production scraper.

Step 3: Build a pandas dataframe

With a list of records, pandas gives you a dataframe in one line and a toolkit for everything after. Collect, parse, and load all the rows, then look at what you have before you trust it. The dedup and dropna steps matter more than they look: a dataset full of duplicate or half-empty rows teaches a model the wrong thing.

python
import pandas as pd

urls = [
    "https://www.example.com/category/page/1",
    "https://www.example.com/category/page/2",
]

records = []
for page in collect(urls):
    records.extend(parse_products(page))

df = pd.DataFrame(records)
df = df.drop_duplicates(subset=["name", "source"])
df = df.dropna(subset=["name"])

print(df.shape)
print(df.head())
df.to_csv("dataset_raw.csv", index=False)

Writing dataset_raw.csv at this stage gives you a checkpoint: collection is slow and rate-limited, so you never want to re-scrape just because a later cleaning step had a bug. Load the CSV for the rest of the pipeline and keep the collector as a separate, occasional job.

Step 4: Clean and label the rows

Scraped fields arrive as messy strings: a price is "$118", a rating is "4.5 out of 5", a review is free text. A model needs numbers and a target column, so this step normalizes the raw fields and derives a label. Here the label is a simple sentiment proxy from the rating, which turns an unlabeled scrape into a supervised classification dataset.

python
import re
import pandas as pd

df = pd.read_csv("dataset_raw.csv")

def to_float(value):
    match = re.search(r"(\d+(?:\.\d+)?)", str(value))
    return float(match.group(1)) if match else None

df["price"] = df["price"].apply(to_float)
df["rating"] = df["rating"].apply(to_float)
df["review"] = df["review"].fillna("").str.strip()

# derive a supervised label from the rating
df = df.dropna(subset=["rating"])
df["label"] = (df["rating"] >= 4.0).astype(int)

print(df["label"].value_counts())
df.to_csv("dataset_clean.csv", index=False)

Checking value_counts on the label is not optional. Scraped data is rarely balanced, and a target that is 95 percent one class will produce a model that looks accurate while learning nothing. If the split is lopsided, rebalance before training, by resampling or by weighting the classes. For a deeper treatment of normalizing scraped fields for ML, see structure and clean web scraped data for AI and ML.

Step 5: Prepare features for a model

The last step turns the clean dataframe into the numeric matrix a model trains on. Text needs vectorizing and numeric columns benefit from scaling, so a scikit-learn ColumnTransformer applies the right transform to each column in one pass. The output is a feature matrix X and a label vector y, split into train and test sets, ready to hand to any estimator.

python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("dataset_clean.csv").fillna({"review": ""})

features = df[["review", "price"]]
y = df["label"]

pre = ColumnTransformer([
    ("text", TfidfVectorizer(max_features=5000), "review"),
    ("num", StandardScaler(), ["price"]),
])

X = pre.fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"train: {X_train.shape}, test: {X_test.shape}")

From here, X_train and y_train drop straight into any scikit-learn estimator's fit method, and the held-out test split gives you an honest accuracy read. The stratify=y argument keeps the class balance consistent across the split, which matters most when your label is skewed. The collection-to-features chain is the reusable part: swap the selectors and the labeling rule and the same five steps build a dataset for a different problem.

Keeping the collection layer healthy

A dataset you can rebuild on demand is worth far more than a one-time dump, so the collector needs to keep working as targets change. A few habits keep a long run healthy.

  • Pace and rotate. Spread requests out and route through rotating residential IPs so no single address trips a rate limit. The Crawling API handles rotation for you; if you build your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning challenges is telling you the current rate or IP tier is no longer enough. Treat that as signal and back off, rather than retrying into a block.
  • Checkpoint raw HTML. Save what you fetch before you parse it, so a parser bug never costs you a re-scrape.

For the full playbook, see how to scrape websites without getting blocked. And once the dataset exists, AI data extraction and how it works covers turning messy pages into structured fields more automatically.

The honest part: ethics and legality

Building an ML dataset carries the same responsibilities as any scrape, and whether it is allowed depends on each site's terms of service, your jurisdiction, and what you do with the data. Collect only public data, respect each site's robots.txt and stated rate expectations, and keep request volume low enough that you are not straining anyone's servers.

Two points matter more for ML specifically. Never collect personal data or anything tied to identifiable individuals, and be careful that a derived label or feature does not reconstruct it. And remember that a model inherits the biases of its training data: a set scraped from one region, language, or platform produces a model that generalizes poorly outside it. For commercial reuse, get permission or an official data agreement rather than assuming silence is consent.

Recap

Key takeaways

  • Web data is diverse, fresh, and abundant. Those three properties are exactly what a model needs to generalize, which is why scraping powers so many ML datasets.
  • Reliability is the hard part, not parsing. Render client-side pages, rotate IPs, and pace requests, or the collector falls over before you have enough rows.
  • The Crawling API folds rendering and rotation into one call. Use the JS token for client-side pages and the normal token for static ones, chosen per source.
  • Clean and label before training. Normalize messy strings to numbers, derive a target column, and always check the class balance.
  • Feature prep makes the dataset model-ready. Vectorize text, scale numerics, and split into train and test with a single scikit-learn transformer.
  • Stay on public data. Respect ToS and robots.txt, avoid personal data, and watch for bias your training set bakes in.

Frequently Asked Questions (FAQs)

Is web scraping used in machine learning?

Yes, extensively. The ability to collect large amounts of public data from many sources lets you build training sets that are bigger and more diverse than hand-labeled data alone, which is exactly what helps a model generalize. Scraping also keeps a dataset fresh, so models in fast-moving domains like pricing or sentiment stay aligned with current conditions rather than learning from stale snapshots.

How do I collect web data for a machine-learning dataset at scale?

The bottleneck is reliability, not parsing. Many pages render client-side and most commercial sites block automated traffic, so you need rendering and a trusted, rotating IP pool to fetch thousands of pages without getting cut off. The Crawling API handles both in one call: send a URL, get back finished HTML, and parse it into records. Pace your requests, checkpoint the raw HTML, and keep the collector as a separate job from the rest of the pipeline.

Do I need the normal token or the JS token?

It depends on the source. The normal token fetches static HTML and is faster and cheaper, so use it when the page already contains the data you want. The JS token renders the page in a real browser first, which you need for client-side-rendered sites where a plain fetch returns an empty shell. Choose per source rather than picking one for the whole job.

How do I turn scraped pages into a labeled dataset?

Parse each page into flat records, load them into a pandas dataframe, then clean and label. Normalize messy fields to numbers (strip currency symbols, extract ratings), drop duplicates and empty rows, and derive a target column from a field you trust, for example mapping a high rating to a positive label. Always check the class balance before training, because scraped data is rarely balanced.

How do I prepare scraped data as features for a model?

Convert each column into a numeric form the model can read. Vectorize text fields with something like TF-IDF, scale numeric columns so no single feature dominates, and apply both in one pass with a scikit-learn ColumnTransformer. Split the result into train and test sets, stratifying on the label so the class balance is preserved, and the feature matrix is ready to fit any estimator.

It depends on each site's terms of service, your jurisdiction, and your purpose. Keep strictly to public data, respect robots.txt and rate expectations, and never collect personal data or anything tied to identifiable individuals. Be mindful that a model inherits the biases of its training set. For commercial reuse, get permission or an official data agreement rather than relying on a scraper.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available