Structure and Clean Web Data for AI

Raw web-scraped data almost never lands in a state a model can use. Pull a few thousand product pages, listings, or articles and you get duplicated rows, prices stored as strings, dates in five formats, blank cells, and text studded with HTML entities and stray whitespace. Feed that straight into a training job and the model learns the noise as readily as the signal. The work that turns a scrape into a dataset is data cleaning and structuring, and it is where most of the accuracy of an AI or ML pipeline is actually won or lost.

This guide is a hands-on walkthrough to structure and clean web data for AI: load a raw scrape, deduplicate it, normalize types and formats, handle missing values, do basic text cleaning and tokenization, design a schema your downstream code can rely on, and validate the result before it ever reaches a model. Every step has runnable Python you can paste into a notebook. At the end we note where Crawlbase can return clean, structured output up front so you do less of this by hand.

Why cleaning and structuring decide model quality

Models do not reason about your data the way you do. A duplicated row is extra weight on one example. A price stored as "$1,299.00" is a string the model cannot compare to 1299.0. A date written as "03/04/2025" in some rows and "2025-04-03" in others becomes two unrelated tokens. None of these throw an error, which is exactly why they are dangerous: the pipeline runs, the metrics look plausible, and the model is quietly learning from a corrupted view of the world.

Cleaning fixes the obvious damage (duplicates, missing values, inconsistent formats) and structuring imposes a contract: every column has one type, one unit, and one meaning. That contract is what lets the same dataset feed a classifier today and a different model next quarter without surprises. The same discipline applies whether you are doing web scraping for machine learning on a few thousand rows or running a large-scale web scraping job into the millions.

Start with a realistic raw scrape

To make the steps concrete, assume you scraped a set of ecommerce product listings and wrote them to raw_products.csv. A real scrape is messy, so the file below is too: duplicated rows, currency symbols and thousands separators in price, mixed date formats, blank cells, and review text with HTML entities and ragged whitespace.

python

import pandas as pd

df = pd.read_csv("raw_products.csv")

# A first look at the damage before touching anything
print(df.shape)
print(df.dtypes)
print(df.isna().sum())
print(df.head())

The dtypes output is the tell. If price comes back as object instead of a numeric type, pandas could not parse it, which means there is non-numeric junk in the column. Running isna().sum() early tells you which columns have missing values and how bad it is, so you can decide what to fix before you write a single transformation.

Deduplicate first

Deduplication comes before everything else because duplicates inflate every later statistic. Exact duplicates are the easy case: identical rows from a scraper that revisited the same URL or paginated over an overlapping window.

python

# Drop fully identical rows
df = df.drop_duplicates()

# Deduplicate on a business key, keeping the most recent capture
df = (df.sort_values("scraped_at")
        .drop_duplicates(subset=["product_id"], keep="last"))

The second pattern matters more in practice. Two captures of the same product_id are not identical rows once the price changed, but you usually want one record per product, not two. Sorting by capture time and keeping "last" gives you the freshest version. Pick whichever key actually identifies an entity in your domain (a product ID, a URL, a SKU) rather than trusting full-row equality.

Deduplicate before you impute

Order is not cosmetic here. If you fill missing values first and deduplicate second, your imputed averages are computed over inflated counts and skewed toward whichever entities were duplicated most. Always drop duplicates before any statistic (mean, median, mode) that later steps depend on.

Normalize types and formats

With one row per entity, make each column a single, predictable type. This is the step that turns "$1,299.00" into 1299.0 and five date formats into one. Currency strings need their symbols and separators stripped before they will parse as numbers; dates need a single parser with errors="coerce" so unparseable junk becomes NaT instead of crashing the run.

python

# Strip currency symbols and separators, then parse to float
df["price"] = (df["price"]
    .astype("string")
    .str.replace(r"[^\d.]", "", regex=True))
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# Parse mixed date formats into one datetime type
df["listed_on"] = pd.to_datetime(
    df["listed_on"], errors="coerce"
)

# Normalize a categorical column: trim and lowercase
df["category"] = df["category"].str.strip().str.lower()

Normalizing categoricals is the quiet win here. Scrapes routinely produce "Electronics", "electronics ", and "ELECTRONICS" as three distinct values for one category. Trimming and lowercasing collapses them into one, which means cleaner group-by results and one feature instead of three after encoding. Do the same standardization on units: if some prices are in dollars and others in cents, convert to a single unit now, while you still remember which is which.

Handle missing values deliberately

There is no universal rule for missing data, only trade-offs. Dropping rows is safe when the gaps are rare and the row is useless without the field. Imputation keeps the row but invents a value, so it only makes sense when the column is needed and a reasonable estimate exists. Decide per column rather than applying one blanket call to the whole frame.

python

# Drop rows missing a field the record cannot exist without
df = df.dropna(subset=["product_id", "price"])

# Impute a numeric column with its median (robust to outliers)
df["rating"] = df["rating"].fillna(df["rating"].median())

# Fill a categorical with an explicit sentinel, not a guess
df["brand"] = df["brand"].fillna("unknown")

Median beats mean for imputing numeric columns because a few extreme outliers (a mis-scraped price of 99999) drag the mean but barely move the median. For categoricals, an explicit "unknown" sentinel is honest: it tells the model the value was absent instead of pretending it belonged to whichever category was most common. Never let a downstream encoder silently treat NaN as a real category.

Clean text and tokenize

If your dataset carries free text (product descriptions, reviews, article bodies) it needs its own pass. Scraped text arrives with HTML entities (&, '), leftover tags, URLs, and inconsistent whitespace. Clean it before tokenizing or the tokens will be full of garbage. Tokenization here means splitting text into the units a model consumes; the example below does whitespace tokenization, which is enough to illustrate the cleaning that has to come first.

python

import re
import html

def clean_text(value):
    if pd.isna(value):
        return ""
    text = html.unescape(str(value))   # &amp; -> &
    text = re.sub(r"<[^>]+>", " ", text)   # strip tags
    text = re.sub(r"http\S+", " ", text)   # strip URLs
    text = re.sub(r"\s+", " ", text)         # collapse whitespace
    return text.strip().lower()

df["review_clean"] = df["review"].apply(clean_text)
df["tokens"] = df["review_clean"].str.split()

The order inside clean_text is deliberate: unescape entities first so &amp; becomes & before any regex runs, strip tags and URLs next, then collapse whitespace last so the gaps left by the previous removals close up. Lowercasing at the end keeps token counts honest ("Fast" and "fast" become one token). For real NLP work you would swap the final split for a proper tokenizer, but the cleaning above is the part that scraped data always needs.

Crawlbase Crawling API

Most of the entity-stripping above exists because raw HTML scrapes are noisy. The Crawling API can return clean, structured output (including a markdown view of the page) so the text arrives without tags and boilerplate, cutting the cleaning step down to type normalization. Point it at a public page on the free tier and compare the output to a raw fetch.

Start free

Design a schema and enforce it

Up to here the cleaning has been reactive. A schema makes it a contract: a declared set of columns, each with one type, that every batch must satisfy. Encoding the schema as types you cast to (and as assertions you check) means the next scrape either conforms or fails loudly, instead of drifting silently into the same mess you just cleaned.

python

# A schema is just a column -> dtype contract
schema = {
    "product_id": "string",
    "category":   "category",
    "price":      "float64",
    "rating":     "float64",
    "brand":      "string",
    "listed_on":  "datetime64[ns]",
}

# Keep only schema columns, in order, and cast each one
df = df[list(schema.keys())].astype(schema)

Selecting list(schema.keys()) drops any stray columns the scraper added and fixes column order, so every export has the same shape. The astype(schema) call casts each column and will raise if a value cannot be coerced, which is the behavior you want: better a loud failure now than a corrupt column discovered after a training run. Using the category dtype for low-cardinality fields like category also shrinks memory and speeds up group-bys on large frames.

Validate before export

Validation is the gate between "looks clean" and "is clean." A handful of assertions catch the failures that silently poison a model: surviving duplicates, out-of-range numbers, nulls in columns that should be complete. Run them on every batch and stop the pipeline when one fails.

python

def validate(frame):
    assert frame["product_id"].is_unique, "duplicate product_id"
    assert frame["price"].between(0, 100000).all(), "price out of range"
    assert frame["rating"].between(0, 5).all(), "rating out of range"
    assert frame[["product_id", "price"]].notna().all().all(), "unexpected nulls"
    return frame

df = validate(df)

Range checks earn their keep: a rating of 50 on a 0 to 5 scale or a negative price is almost always a parsing bug from an earlier step, and the assertion surfaces it before the data reaches a model. If you outgrow hand-written asserts, a schema-validation library such as Pandera or Great Expectations expresses the same rules declaratively, but the asserts above are enough to make a pipeline trustworthy.

Export the clean dataset

With the frame deduplicated, normalized, imputed, schema-cast, and validated, write it out in a format that preserves your types. CSV is portable but stringly typed; Parquet keeps dtypes, compresses well, and loads faster, which matters once you are doing AI model training over the result.

python

# Parquet preserves dtypes and is fast to reload
df.to_parquet("clean_products.parquet", index=False)

# CSV if you need maximum portability
df.to_csv("clean_products.csv", index=False)

That file is now a dataset, not a scrape: one row per entity, one type per column, no surviving duplicates, missing values handled on purpose, and every value inside its declared range. From here the path to a model is the familiar one of feature engineering and a train/test split, and the data underneath it will not surprise you.

Let the source return cleaner data

The fastest cleaning step is the one you skip because the data arrived clean. A lot of the work above (entity unescaping, tag stripping, whitespace collapsing) exists only because you scraped raw HTML. The Crawling API can return a clean, markdown-style view of a page so the text comes without tags and boilerplate, and the Crawling API auto-parses many popular sites into structured JSON fields, which removes the selector-writing and most of the type-guessing before pandas ever sees the data. When you need rotating residential IPs without managing a pool, the Smart AI Proxy covers that side.

None of this removes the need to deduplicate, validate, and enforce a schema (those are properties of your dataset, not the page), but it does shrink the noisy first half of the job. For where this dataset goes next, see how AI data extraction works and, for high-volume capture, the patterns in ecommerce web scraping.

Recap

Key takeaways

Deduplicate first. Drop duplicates before any statistic, or imputed averages are computed over inflated, skewed counts.
Normalize types and formats. Strip currency symbols to floats, parse mixed dates with errors="coerce", and trim plus lowercase categoricals.
Handle missing values per column. Drop rows missing an essential field, impute numerics with the median, and fill categoricals with an explicit "unknown".
Clean text before tokenizing. Unescape entities, strip tags and URLs, then collapse whitespace, in that order.
Define a schema and validate. Cast every column to one declared type and assert uniqueness, ranges, and non-null on each batch so bad data fails loudly.
Cleaner input means less work. Crawlbase can return clean or markdown output and auto-parsed JSON, shrinking the cleaning step before pandas runs.

Frequently Asked Questions (FAQs)

Why is data cleaning so important before training an AI model?

Because models learn whatever is in the data, including the mistakes. Duplicates over-weight some examples, string-typed prices cannot be compared, mixed date formats fragment into unrelated values, and missing cells get misread by encoders. None of these throw errors, so the pipeline runs and the model quietly trains on a corrupted view. Cleaning removes that damage so the model learns the signal instead of the noise.

Should I deduplicate or handle missing values first?

Deduplicate first. If you impute before dropping duplicates, every average you fill with is computed over inflated counts and skewed toward whichever entities were duplicated most. Drop exact duplicates and collapse on a business key (keeping the freshest capture), then compute the medians and modes you use for imputation.

How do I decide between dropping rows and imputing missing values?

Decide per column. Drop the row when the missing field is one the record cannot exist without, such as a primary key or the target value, and the gaps are rare. Impute when the column is needed downstream and a defensible estimate exists: median for numerics because it resists outliers, and an explicit "unknown" sentinel for categoricals so the absence is recorded rather than guessed.

What is the minimum text cleaning scraped data needs?

Unescape HTML entities, strip leftover tags and URLs, and collapse runs of whitespace into single spaces, in that order, then lowercase. Scraped text routinely contains &, stray markup, and ragged spacing that would otherwise become noisy tokens. That pass is enough before whitespace tokenization; for production NLP you would swap in a dedicated tokenizer afterward.

Why bother enforcing a schema if the data is already clean?

Because the next batch will not be clean unless something forces it to be. A schema declares one type per column and casts every batch to it, so a scrape that drifts (a new column, a price that suddenly will not parse) fails loudly instead of silently reintroducing the mess you just removed. It turns cleaning from a one-off into a repeatable contract.

Can Crawlbase reduce how much cleaning I have to do?

Yes, for the noisy first half. The Crawling API can return a clean, markdown-style view of a page so text arrives without tags or boilerplate, and the Scraper API auto-parses many popular sites into structured JSON, which removes selector writing and most type guessing. You still deduplicate, validate, and enforce a schema yourself, because those are properties of your dataset rather than the source page.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why cleaning and structuring decide model quality

Start with a realistic raw scrape

Deduplicate first

Normalize types and formats

Handle missing values deliberately

Clean text and tokenize

Design a schema and enforce it

Validate before export

Export the clean dataset

Let the source return cleaner data

Key takeaways

Frequently Asked Questions (FAQs)

Why is data cleaning so important before training an AI model?

Should I deduplicate or handle missing values first?

How do I decide between dropping rows and imputing missing values?

What is the minimum text cleaning scraped data needs?

Why bother enforcing a schema if the data is already clean?

Can Crawlbase reduce how much cleaning I have to do?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Building an LLM-Ready Stack Exchange Corpus: 33 Million Threads with the Crawling API

Turn Codex into a Full-Stack Web Scraper: Live Web Access with Web MCP

Build an AI Research Dataset with Web MCP: Crawl Once, Reuse Forever

The infrastructure brief, in your inbox.

We use cookies

Customize cookies