Amazon is one of the largest public catalogs of product data on the web. Every product page exposes a title, a price, a star rating, and an availability status, and that data feeds competitor price tracking, market research, assortment analysis, and demand monitoring. Saving one product by hand is trivial. Saving hundreds or thousands of them is where you need a scraper, and where Amazon's defenses start to push back.

This guide shows you how to scrape Amazon product data in Ruby. You build a small, runnable script that fetches a rendered product page through the Crawling API, parses it with Nokogiri, and pulls a clean record: product title, price, rating, and availability. The walkthrough stays scoped to public product data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.

What you will build

A Ruby script that takes an Amazon product URL, retrieves the page through the Crawling API, and extracts a structured record with Nokogiri. We pull these fields from the product page:

  • Title the product name shown in the page heading, for example "Echo Dot (5th Gen)".
  • Price the listed price on the buy box.
  • Rating the average star rating, when the product has one.
  • Availability the stock status, for example "In Stock" or "Currently unavailable".

Why a plain request fails on Amazon

If you request an Amazon product URL with a bare HTTP client, you rarely get the page a shopper sees. Two things work against you. First, Amazon flags automated traffic quickly: datacenter IPs and request patterns that do not look like a real browser get challenged with a CAPTCHA, served a robot check, or blocked outright before they reach the product markup. Second, parts of the page render or change based on the visitor's region, session, and device, so a naive fetch often returns a stripped or inconsistent version of the listing.

So a working Amazon scraper needs an IP the platform reads as a real shopper and, where the page leans on rendering, a browser that actually runs the page. You can assemble that yourself with a pool of rotating residential proxies and a headless browser, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it fetches the page behind a trusted residential IP and handles the CAPTCHA layer, and it returns finished HTML for you to parse.

Normal token vs JS token

Crawlbase offers two token types. The normal token fetches the page HTML and is the right default for an Amazon product page. The JavaScript (JS) token additionally renders the page in a real browser first, which costs more credits and is worth reserving for content that only appears after client-side scripts run. Start with the normal token here and switch to the JS token only if a field you need comes back empty.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic Ruby. You should be comfortable writing and running a Ruby script and installing gems. If you are new to the language, the official Ruby documentation and any beginner course will get you to the level this tutorial assumes.

Ruby 2.7 or later. Confirm your version with ruby --version. If you do not have it, install it from ruby-lang.org or through a version manager like rbenv or rvm.

A Crawlbase account and token. Sign up, open your dashboard, and copy your normal request token from the account page. Treat the token like a password: it authenticates your requests, so keep it out of version control. Crawlbase gives you 1,000 free requests to start, with no card required, and you only pay for successful requests after that.

Set up the project

Create a file named amazon_scraper.rb for your code, then install the two gems the scraper needs: the official Crawlbase client and Nokogiri for parsing.

bash
ruby --version

gem install crawlbase
gem install nokogiri

Two dependencies do the work: crawlbase is the official client for the Crawling API, and nokogiri parses the returned HTML so you can pull each field out of the page by CSS selector. If you prefer a Gemfile, add gem 'crawlbase' and gem 'nokogiri' and run bundle install instead.

Understanding the Amazon product page

An Amazon product page organizes the listing into stable, identifiable elements. The product title sits in a heading with the id productTitle. The price appears in the buy box, the average rating in the reviews summary, and the stock status in an availability block. Amazon's markup is large and varies by category, but these core fields carry consistent ids and attributes you can target.

Before writing selectors, open a product page in your browser, right-click the title or price, and choose Inspect. Note the id or class on each field you want. Amazon ships several price layouts depending on the deal type, so it helps to look at a couple of products before settling on selectors. The ids like #productTitle are the most durable anchors; class-based price spans change more often, so plan to verify them against a live page.

Step 1: Fetch the product page

Start by getting the page HTML. Require the Crawlbase gem, initialize the API with your token, point it at a product URL, and request it. Check the status code before you parse so failures stay loud instead of silent.

ruby
require 'crawlbase'

api = Crawlbase::API.new(token: 'YOUR_CRAWLBASE_TOKEN')
url = 'https://www.amazon.com/dp/B09B8V1LZ3'

response = api.get(url)

if response.status_code == 200
  html = response.body
  puts html[0..500]
else
  puts "Request failed: #{response.status_code}"
end

The Crawlbase::API.new call sets up a client tied to your token, and api.get(url) fetches the page behind a trusted IP. The response carries a status_code and a body; checking the code before reading the body means a block or an error surfaces immediately rather than producing a confusing parse failure later. Run this and you should see real Amazon product markup in the first 500 characters, not a robot-check page. That confirms the fetch works before you write a single selector.

Crawlbase Amazon Scraper

That single api.get(url) call is doing the hard part for you. Amazon needs a request that arrives from an IP it reads as a real shopper and clears the CAPTCHA layer, in one shot. The Crawling API fetches the page behind rotating residential IPs and handles the bot checks server-side, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a product URL on the free tier first.

Step 2: Parse the fields with Nokogiri

With the page HTML in hand, load it into Nokogiri and pull each field by its selector. Amazon exposes the title through the #productTitle id, the price through the buy box price span, the rating through the reviews summary, and the stock status through the availability block. A small helper that returns nil when an element is missing keeps extraction from crashing on a field that a given listing does not carry.

ruby
require 'nokogiri'

def text_at(doc, selector)
  node = doc.at_css(selector)
  node ? node.text.strip : nil
end

def parse_product(html)
  doc = Nokogiri::HTML(html)

  title = text_at(doc, '#productTitle')
  price = text_at(doc, '.a-price .a-offscreen')
  rating = text_at(doc, '#acrPopover .a-icon-alt')
  availability = text_at(doc, '#availability')

  {
    title: title,
    price: price,
    rating: rating,
    availability: availability
  }
end

The text_at helper queries one element and returns its trimmed text, or nil when the element is absent, so a missing field never raises on a .text call against nothing. For the price, .a-price .a-offscreen targets the screen-reader copy of the price, which is the cleanest single value across Amazon's various price layouts. The rating reads from the #acrPopover tooltip text (a string like "4.6 out of 5 stars"), and availability comes from the #availability block. The legacy version of this tutorial used the older #priceblock_ourprice id; Amazon has since moved to the .a-price structure, which is why verifying selectors against a live page matters.

Selectors drift

Amazon ships multiple price and availability layouts depending on the product, the deal type, and the region, and it adjusts class names over time. Treat the selectors above as a starting template, not a contract. When a field comes back as nil, open the live product page in your browser's dev tools, find the current id or class, and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Put it together

Now wire the fetch and the parse into one runnable script. Fetch the product page, hand the HTML to the parser, and print the structured record.

ruby
require 'crawlbase'
require 'nokogiri'

api = Crawlbase::API.new(token: 'YOUR_CRAWLBASE_TOKEN')

def text_at(doc, selector)
  node = doc.at_css(selector)
  node ? node.text.strip : nil
end

def parse_product(html)
  doc = Nokogiri::HTML(html)
  {
    title: text_at(doc, '#productTitle'),
    price: text_at(doc, '.a-price .a-offscreen'),
    rating: text_at(doc, '#acrPopover .a-icon-alt'),
    availability: text_at(doc, '#availability')
  }
end

url = 'https://www.amazon.com/dp/B09B8V1LZ3'
response = api.get(url)

if response.status_code == 200
  product = parse_product(response.body)
  product[:url] = url
  puts JSON.pretty_generate(product)
else
  puts "Request failed: #{response.status_code}"
end

This is the whole scraper. It requires the two gems, builds the client, fetches the page, parses the four fields, and prints them. JSON.pretty_generate comes from Ruby's standard library, so add require 'json' at the top if your environment does not load it automatically. Swap the url for any Amazon product and the same parser handles it, since the field selectors are page-structure based, not product specific.

What the output looks like

Run the script with ruby amazon_scraper.rb and you get a clean record, ready to write to JSON, CSV, or a database.

json
{
  "title": "Echo Dot (5th Gen, 2022 release) | Smart speaker with Alexa | Charcoal",
  "price": "$49.99",
  "rating": "4.7 out of 5 stars",
  "availability": "In Stock",
  "url": "https://www.amazon.com/dp/B09B8V1LZ3"
}

From here you can store the record, append it to a CSV, or feed it into a pricing dashboard. Because the parser returns a plain Ruby hash, writing it to any format is a one-liner with the standard library.

Scaling to many products

One product is a demo; a real job runs across a list of products. The cleanest way to scale is to keep your URLs in an array, loop over them, and collect each parsed record. A short pause between requests paces the run so you are not hammering Amazon in a tight loop.

ruby
urls = [
  'https://www.amazon.com/dp/B09B8V1LZ3',
  'https://www.amazon.com/dp/B07FZ8S74R',
  'https://www.amazon.com/dp/B08N5WRWNW'
]

results = []

urls.each do |url|
  response = api.get(url)
  next unless response.status_code == 200

  product = parse_product(response.body)
  product[:url] = url
  results << product
  puts "Scraped: #{product[:title]}"

  sleep 2
end

puts JSON.pretty_generate(results)

The next unless guard skips any URL that did not return a clean 200 so one bad response does not stop the run, and sleep 2 between requests keeps the pace civil. If you have thousands of URLs and want them processed concurrently without managing your own queue, the asynchronous Crawler is built for that. For collecting product URLs in the first place, the companion guide on scraping Amazon product data and the ecommerce web scraping overview both pick up where this leaves off.

Staying unblocked

Even with a trusted IP handling the fetch, Amazon watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard commercial target.

  • Pace your requests. Spread requests out with a delay between products instead of crawling a list at full speed. The sleep 2 in the loop is a floor, not a ceiling.
  • Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
  • Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.

For the broader playbook, see how to scrape websites without getting blocked. If you would rather work in another language, the web scraping with Java guide covers the same approach with a different toolchain.

Whether scraping Amazon is allowed depends on Amazon's terms of service, your jurisdiction, and what you do with the data. Amazon's Conditions of Use restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read Amazon's Conditions of Use and its robots.txt, and treat both as the boundary for what you collect.

A few lines worth holding to. Collect only public data: the product titles, prices, ratings, and availability that anyone can see on a product page without an account. Respect Amazon's stated rate expectations and keep your request volume low enough that you are not straining its servers. Avoid personal data, including anything tied to identifiable shoppers, reviewers, or sellers beyond what is publicly listed, and do not redistribute copyrighted media such as product images or review text as if it were your own. If you plan to reuse the data commercially, get permission or an official agreement rather than assuming silence is consent.

This guide is deliberately scoped to public product pages because that is the line that keeps the work defensible. It does not cover anything behind a login, account or order data, payment or checkout flows, or any attempt to bypass authentication. For licensed or bulk access, Amazon offers official APIs through its Product Advertising and Selling Partner programs, and that is the right tool when you need large volumes, guaranteed structure, or commercial rights. If your project needs more than public listings, an official API or a data agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

  • Amazon blocks naive requests. A bare HTTP client gets CAPTCHAs and robot checks, so you need a request that arrives from a trusted IP and clears those defenses.
  • The Crawling API handles the hard part. One api.get(url) call fetches the page behind rotating residential IPs and the CAPTCHA layer, so you skip running your own proxy pool and browser fleet.
  • Nokogiri does the extraction. Load the HTML and map title, price, rating, and availability to current selectors like #productTitle and .a-price .a-offscreen, and expect those selectors to drift.
  • Scale with a loop and a delay. Iterate a list of product URLs, guard on the status code, pace requests with sleep, and reach for the async Crawler at high volume.
  • Stay on public data. Respect Amazon's terms and robots.txt, prefer an official Amazon API for licensed or bulk data, and never touch accounts, orders, or personal information.

Frequently Asked Questions (FAQs)

Why does a plain Ruby request fail on Amazon?

Amazon flags automated traffic fast. A bare Net::HTTP or open-uri request from a datacenter IP usually gets a CAPTCHA, a robot-check page, or an outright block instead of the product markup. To get real data you need a request that arrives from an IP Amazon reads as a real shopper and clears the bot checks, which is what the Crawling API handles for you.

Do I need the normal token or the JS token for Amazon?

Start with the normal token. For a standard Amazon product page, the normal token returns the title, price, rating, and availability you need, and it costs fewer credits. Switch to the JavaScript (JS) token only if a specific field you want renders client-side and comes back empty with the normal token.

Which selectors should I use for the price?

Use .a-price .a-offscreen, which targets the screen-reader copy of the price and is the most consistent single value across Amazon's price layouts. The older #priceblock_ourprice id from past tutorials no longer matches most pages. Because Amazon ships several price layouts, verify the selector against the live product before a large run.

How do I scrape many Amazon products at once?

Keep your product URLs in an array, loop over them, parse each page with the same function, and collect the records. Guard on the status code so one bad response does not stop the run, and add a short sleep between requests to pace it. For thousands of URLs, the asynchronous Crawler processes them concurrently without you managing a queue.

Can I scrape order, account, or Buy Box-only data from Amazon?

This guide does not cover login-walled data. Order history, account details, and checkout flows sit behind authentication, so they are not public data, and scraping them or bypassing the login runs against Amazon's terms. For sanctioned access to richer data, the correct route is an official Amazon API or a partner agreement.

Why use Nokogiri instead of a regex to parse the HTML?

Nokogiri parses the page into a proper DOM, so you select fields by CSS selector or XPath and get reliable results even when the markup around them shifts. Regular expressions against HTML break the moment Amazon reorders attributes or nests an element differently. For any real scraper, a parser like Nokogiri is both more robust and easier to maintain.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available