Scraping gets you rows; it does not get you answers. The data that comes off a crawler is almost always messy: prices stored as strings with currency symbols, dates in three different formats, half-filled columns, and the same product listed twice because two pages linked to it. Before any of it is useful you have to clean it, and the standard tool for that job in Python is pandas.
This guide takes a realistic scraped dataset, a CSV of product listings, and walks it all the way from raw rows to insight using pandas. You load it into a DataFrame, inspect it, handle missing values and duplicates, fix the column types, parse prices and dates into real numbers and timestamps, then filter, group, aggregate, sort, and export the result. Every snippet is real and copy-pasteable. The only assumption is that the input was collected responsibly from public sources in the first place.
What you will build
A short, repeatable cleaning and analysis pipeline that turns a raw scrape into a tidy dataset and a few summary tables. By the end you will have touched every field on a typical product row:
- product_name: free text, often with stray casing and whitespace.
- category: the grouping column you will aggregate on.
-
price: scraped as a string like
"$1,299.00"that must become a float. - rating: a numeric column with missing values to handle.
- in_stock: a boolean-ish flag stored inconsistently as text.
- scraped_at: a date string to parse into a real timestamp.
The same workflow applies to any tabular scrape, listings, reviews, job posts, or real-estate rows. If your crawler already writes clean JSON or CSV, you skip straight to the analysis half. For a deeper look at the two formats and when to pick each, see JSON vs CSV: the main differences.
Why scraped data needs cleaning first
Raw scraped data is rarely analysis-ready. The extraction layer pulls whatever text sits inside a selector, so a price comes back as "$1,299.00" rather than the number 1299.0, a rating cell can be blank when the page never rendered it, and the same listing can appear twice when pagination overlaps. Feed that straight into a calculation and you get errors or, worse, quietly wrong answers.
Cleaning fixes three classes of problem before any analysis runs: missing values from incomplete extraction, duplicate rows from overlapping crawls, and inconsistent formats across price, date, and text columns. Get those right and the rest of pandas, sorting, grouping, aggregation, becomes a few one-liners. If you are building this discipline into a larger pipeline, the patterns in structure and clean web-scraped data for AI and ML generalise well beyond a single CSV.
Prerequisites
You do not need much to follow along.
Basic Python. You should be comfortable running a script and reading a few lines of code. Knowing what a list and a dictionary are is enough; pandas does the heavy lifting.
Python 3.8 or later. Check your version with python --version. If you do not have it, install it from python.org.
A scraped CSV or JSON file. Any tabular export will do. The examples below assume a file named products.csv with the columns listed above, but the methods are the same whatever your columns are called.
Set up the project
Create a virtual environment so the project stays isolated, then install pandas. That is the only required dependency; matplotlib is optional and only needed for the quick plot at the end.
python --version python -m venv pandas_env source pandas_env/bin/activate pip install pandas matplotlib
On Windows, activate the environment with pandas_env\Scripts\activate instead of the source line. pandas gives you the DataFrame and every cleaning method below; matplotlib is what pandas plots through when you call .plot() at the end.
Step 1: Load the data into a DataFrame
Everything in pandas starts with a DataFrame, a table-like structure of rows and columns, much like a spreadsheet. A single column of it is a Series. Reading a scraped CSV into a DataFrame is one call, and JSON is just as easy with read_json.
import pandas as pd # Load the scraped product listings df = pd.read_csv("products.csv") # JSON works the same way: # df = pd.read_json("products.json") # Show the first five rows print(df.head())
The head() method prints the first five rows, your first confirmation that the file loaded and the columns landed where you expect. If your scrape produced one JSON object per line, pass lines=True to read_json. From here on, every operation works on the df object in memory; nothing touches the original file until you export at the end.
The cleaner your products.csv is before it reaches pandas, the less of this guide you actually need. The Scraper API auto-parses supported pages into structured JSON fields, prices, titles, ratings, and stock already split into typed values, so the rows you load are close to analysis-ready instead of a wall of raw strings. Less time fixing dtypes, more time finding insight.
Step 2: Inspect the dataset
Before you change anything, look at what you have. Three methods tell you almost everything: head() shows sample rows, info() reports column types and how many non-null values each holds, and describe() gives summary statistics for the numeric columns.
# Shape: (rows, columns) print(df.shape) # Column names, dtypes, and non-null counts df.info() # Summary stats for numeric columns print(df.describe())
Read info() carefully: it is where you spot trouble early. If price shows up as object rather than a number, it is still a string and arithmetic on it will fail. If rating has fewer non-null values than the row count, you have missing data to handle. This single call tells you which of the next steps you actually need.
Step 3: Handle missing values
Missing cells come from incomplete extraction, a field that did not render, a selector that matched nothing on one page. First count them per column, then decide whether to drop the affected rows or fill them with a sensible default.
# Count missing values per column print(df.isnull().sum()) # Drop rows missing a product name or price (can't analyse those) df = df.dropna(subset=["product_name", "price"]) # Fill missing ratings with the column average instead of dropping df["rating"] = df["rating"].fillna(df["rating"].mean())
The choice between dropping and filling is judgement, not a rule. Drop rows when the missing field is essential and cannot be guessed, a listing with no price is useless for price analysis. Fill when the column is secondary and a reasonable stand-in beats losing the whole row; here a missing rating becomes the column mean so those products still count in category aggregates. Using subset on dropna keeps the deletion targeted instead of throwing away any row with a single blank cell.
Step 4: Remove duplicate rows
Overlapping crawls and re-runs leave duplicate rows, and they quietly inflate counts and averages. Find them, then drop them, keeping the first occurrence.
# How many fully duplicated rows are there? print(df.duplicated().sum()) # Drop exact duplicate rows, keep the first df = df.drop_duplicates(keep="first") # Or de-duplicate on a unique key, like a product URL or ID df = df.drop_duplicates(subset=["product_name", "category"])
A bare drop_duplicates() only removes rows that are identical across every column, which misses near-duplicates where one field differs. When your scrape has a natural key, a product ID or a canonical URL, de-duplicate on that with subset so two captures of the same item collapse into one even if the scraped timestamp differs. Run this before any counting so your totals reflect distinct products, not crawl artifacts.
Step 5: Fix data types and parse prices
This is the step that turns scraped strings into values you can compute on. A scraped price arrives as text like "$1,299.00"; you strip the currency symbol and thousands separator, then cast to float. The same idea standardises the stock flag and tidies text columns.
# "$1,299.00" -> 1299.0 df["price"] = ( df["price"] .str.replace(r"[$,]", "", regex=True) .astype("float") ) # Tidy text: strip whitespace and normalise case df["product_name"] = df["product_name"].str.strip() df["category"] = df["category"].str.lower() # "Yes"/"No" text flag -> real booleans df["in_stock"] = df["in_stock"].str.lower().map({"yes": True, "no": False})
The .str accessor lets you run string operations across an entire column at once, so the regex [$,] strips both the dollar sign and the comma from every price before astype("float") turns the column numeric. Lower-casing category matters more than it looks: if half your rows say "Phones" and half say "phones", grouping treats them as two categories until you normalise. Mapping the stock flag to real booleans lets you filter on it cleanly in the next step.
After any type conversion, run df.info() again and confirm the column reads as float64, bool, or datetime64 rather than object. An object dtype on a column you think is numeric is the single most common reason a later calculation throws or silently returns garbage.
Step 6: Parse dates
Date strings are useless for time-based analysis until pandas understands them as timestamps. to_datetime converts the column, and once it is a real datetime you can sort by it, filter ranges, and pull out parts like the month.
# Parse the scraped timestamp into a real datetime df["scraped_at"] = pd.to_datetime(df["scraped_at"], errors="coerce") # Now date parts are available df["scrape_month"] = df["scraped_at"].dt.to_period("M")
Passing errors="coerce" is the safe default on scraped data: any value pandas cannot parse becomes NaT (the datetime equivalent of null) instead of raising and stopping the whole script. If every row shares one known format you can pass format="%Y-%m-%d" to make parsing faster and stricter. With a real datetime column, the .dt accessor exposes year, month, day, and helpers like to_period for monthly bucketing.
Step 7: Filter and select
With the data clean and typed, you can ask questions of it. Boolean indexing filters rows by condition, and you can combine conditions to narrow further.
# Only in-stock products over $100 premium = df[(df["in_stock"]) & (df["price"] > 100)] # Just the columns you care about premium = premium[["product_name", "category", "price", "rating"]] print(premium.head())
Each condition inside the brackets produces a boolean Series, and & combines them row by row, so wrap each condition in parentheses to keep the operator precedence right. Selecting a list of columns with double brackets returns a narrower DataFrame, handy when a scrape has thirty columns and you only want four for a particular question.
Step 8: Group, aggregate, and sort
The real payoff is summarising. groupby splits the rows by a key, then you aggregate each group, average price per category, product count per category, the top rating in each, and sort the result to read it at a glance.
# Average price per category, highest first avg_price = ( df.groupby("category")["price"] .mean() .sort_values(ascending=False) ) # Several aggregates at once summary = df.groupby("category").agg( product_count=("product_name", "count"), avg_price=("price", "mean"), max_price=("price", "max"), avg_rating=("rating", "mean"), ) print(avg_price) print(summary)
The single-metric form, groupby("category")["price"].mean(), returns one number per category and reads almost like English. When you want several metrics in one table, named aggregation with .agg() is cleaner: each keyword becomes an output column, and the (column, function) tuple says which field to summarise and how. Chaining sort_values on the end orders the result so the most and least expensive categories sit at the top and bottom.
Step 9: Apply statistics
For a quick numeric read on a single column, the built-in statistical methods cover most of what you need without leaving pandas.
# Full statistical summary of the price column print(df["price"].describe()) # Individual measures median_price = df["price"].median() std_price = df["price"].std() print(f"Median price: {median_price:.2f}") print(f"Std deviation: {std_price:.2f}")
The describe() call on a single column gives you count, mean, standard deviation, min, max, and the quartiles in one shot, a fast way to spot outliers such as a scraping error that left a price at zero or in the millions. median() is often more honest than the mean on price data because a few very expensive items skew the average upward, and comparing the two tells you how lopsided the distribution is.
Step 10: Visualise and export
A chart makes a distribution obvious, and pandas plots directly through matplotlib. When the analysis is done, write the cleaned data back out so the next stage, a notebook, a dashboard, a teammate, can pick it up.
import matplotlib.pyplot as plt # Average price per category as a bar chart avg_price.plot(kind="bar", title="Average price per category") plt.ylabel("Price") plt.tight_layout() plt.show() # Export the cleaned dataset df.to_csv("products_clean.csv", index=False) summary.to_csv("category_summary.csv") # JSON export works the same way df.to_json("products_clean.json", orient="records", indent=2)
Passing index=False to to_csv keeps pandas from writing the row numbers as a stray first column, which is almost always what you want for a clean export. For JSON, orient="records" produces a list of one object per row, the shape most other tools expect. With that, the raw scrape has become a tidy file and a summary table you can hand off or chart.
What the output looks like
After running the pipeline, the per-category summary reads as a small, clean table. Exported to JSON with orient="records", each row looks like this:
[ { "category": "laptops", "product_count": 128, "avg_price": 942.55, "max_price": 2499.00, "avg_rating": 4.31 }, { "category": "phones", "product_count": 204, "avg_price": 611.20, "max_price": 1399.00, "avg_rating": 4.12 } ]
Those numbers are illustrative, but the shape is exactly what the code produces: one record per category, each carrying the counts and aggregates you defined in .agg(). That structured output is what feeds the next stage, whether that is a price-tracking dashboard, a notebook, or a model.
Scaling beyond one file
A single CSV is the starting point, not the destination. As your scrapes grow, a few habits keep the same pipeline working at volume.
-
Combine many files. When a crawl writes one CSV per run, read them all and stack with
pd.concat([...])into a single DataFrame before cleaning, then de-duplicate so overlapping runs do not double-count. -
Process large files in chunks. If a file is too big for memory, pass
chunksizetoread_csvand iterate, cleaning and aggregating each chunk so you never hold the whole dataset at once. -
Make cleaning a function. Wrap Steps 3 through 6 in a single
clean(df)function so every new batch gets the same treatment, and your pipeline stays reproducible instead of a pile of one-off cells. - Pick the right input format. Nested data, reviews with replies, products with variant lists, often fits JSON better than CSV. The right call depends on your data shape, covered in JSON vs CSV.
If pandas starts feeling limiting on a column you scrape often, the wider ecosystem of Python scraping and data libraries covers the tools that pick up where a single DataFrame leaves off.
Key takeaways
- Clean before you analyse. Scraped data arrives with missing cells, duplicate rows, and string-typed numbers; fix those first or every later calculation is suspect.
- Inspect with head, info, and describe. Those three methods tell you the shape, the dtypes, and the missing-value counts, so you know exactly which cleaning steps you need.
-
Parse prices and dates into real types. Strip currency symbols and cast to float, convert date strings with
to_datetime, and confirm withinfo()that no numeric column is stillobject. - groupby plus agg is where insight lives. Split by a key, aggregate several metrics at once with named aggregation, and sort the result to read it at a glance.
- Cleaner input means less cleaning. The more structured your data is when it reaches pandas, the shorter this whole pipeline gets.
Frequently Asked Questions (FAQs)
How do I install pandas in Python?
Run pip install pandas from your terminal, ideally inside a virtual environment so it does not clash with other projects. Then import it in your script with import pandas as pd, the conventional alias used in nearly all pandas code. If you also want the quick charts shown here, install matplotlib alongside it with pip install pandas matplotlib.
What is the difference between a DataFrame and a Series in pandas?
A DataFrame is the whole table, rows and columns together, much like a spreadsheet or a database table. A Series is a single column of that table, a one-dimensional labelled array. When you select one column with df["price"] you get a Series; most cleaning methods like fillna and str.replace operate on a Series, and you assign the result back to the column.
How do I convert a scraped price string into a number?
Use the .str accessor to strip the non-numeric characters, then cast the column. For a value like "$1,299.00" the call is df["price"].str.replace(r"[$,]", "", regex=True).astype("float"), which removes the dollar sign and comma and turns the column into a float you can compute on. Always recheck with df.info() that the column is no longer an object dtype afterward.
Should I drop or fill missing values in scraped data?
It depends on how essential the column is. Drop the row when the missing field is critical and cannot be reasonably guessed, a listing with no price is useless for price analysis. Fill the value when the column is secondary and a stand-in beats losing the whole row, for example replacing a missing rating with the column mean so the product still counts in category aggregates. Use dropna(subset=[...]) to target specific columns rather than deleting any row with a single blank.
How do I summarise scraped data by category?
Use groupby followed by an aggregation. For one metric, df.groupby("category")["price"].mean() gives the average price per category. For several at once, use named aggregation, df.groupby("category").agg(count=("product_name", "count"), avg_price=("price", "mean")), where each keyword becomes an output column. Chain sort_values on the end to order the result.
Can I clean and analyse scraped data automatically?
Yes. Wrap your cleaning steps in a function and call it on each new batch, then run the whole script on a schedule so every fresh scrape is processed the same way. Keeping the logic in one reproducible function, rather than scattered notebook cells, is what makes the pipeline reliable as your data grows.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
