IMDb is one of the largest public catalogs of film and television on the open web, holding factual metadata on millions of titles: the name of a film, the year it came out, its aggregate user rating, its genres, how long it runs, and who directed it. Researchers studying release trends, hobbyists building a personal film database, and developers prototyping a recommendation feature all reach for the same public title pages, where that metadata sits in a fairly predictable layout.
This guide shows you how to scrape IMDb movie data with JavaScript and Node.js using Cheerio. You build a small, runnable scraper that fetches a public IMDb title page through the Crawling API, parses the movie title, year, IMDb rating, genre, runtime, and director, and exports the result as JSON and CSV. The whole walkthrough stays scoped to public factual film metadata, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Node.js script that takes a public IMDb title URL, retrieves the rendered HTML through the Crawling API, and extracts a structured record for that film. We use The Shawshank Redemption as the running example and pull these factual fields per title:
- Title the primary movie title shown in the page hero, for example "The Shawshank Redemption".
- Year the release year listed next to the title.
- Rating the aggregate IMDb user rating out of 10.
- Genre the genre chips IMDb assigns to the title, such as "Drama".
- Runtime the listed duration of the film.
- Director the credited director of the film.
Why a plain request fails on IMDb
If you request an IMDb title URL with a bare HTTP client, you rarely get the metadata you expect. Two things work against you. First, IMDb renders much of the title page in the browser with JavaScript, so the initial HTML is a thin shell until the page's scripts run and populate the rating, the credits, and the detail rows. Second, IMDb watches for automated traffic: datacenter IPs and request patterns that do not look like a real browser get rate-limited or challenged before they ever reach the rendered page.
So a working IMDb scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it renders the page behind a trusted IP, and it returns finished HTML for you to parse with Cheerio.
The Crawling API gives you two tokens: a normal one and a JavaScript one. IMDb populates the rating and credits in the browser, so use your JavaScript token for every request in this guide. The normal token returns the unrendered shell and your selectors will come back empty.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic JavaScript and Node.js. You should be comfortable writing and running a Node script and installing packages with npm. If you are new to Node, the official docs and any beginner course will get you to the level this tutorial assumes. For a fuller walkthrough, our guide to building a web scraper with Node.js covers the basics.
Node.js 16 or later. Confirm your version with node --version. If you do not have it, install it from the Node.js website or through a version manager like nvm.
A Crawlbase account and token. Sign up, open your dashboard, and copy your JavaScript token from the account docs page. The free tier gives you 1,000 requests with no card, and you only pay for successful requests. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a project folder, initialize it, and install the two libraries the scraper needs.
node --version mkdir imdb-scraper && cd imdb-scraper npm init -y npm install crawlbase cheerio
Two dependencies do the work: crawlbase is the official Node client for the Crawling API, and cheerio parses the returned HTML with a jQuery-style API so you can pull out individual fields by CSS selector. Create a file named scraper.js in this folder and add the code from the steps below.
Step 1: Fetch the rendered title page
Start by getting the finished page. Import the CrawlingAPI class, initialize it with your JavaScript token, and request a public IMDb title URL. For this example we use The Shawshank Redemption at https://www.imdb.com/title/tt0111161/. Checking the status code before you parse keeps failures loud instead of silent.
const { CrawlingAPI } = require('crawlbase'); const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }); const imdbPageURL = 'https://www.imdb.com/title/tt0111161/'; api .get(imdbPageURL) .then((response) => { if (response.statusCode === 200) { console.log(response.body.slice(0, 500)); } }) .catch((error) => console.error('API request error:', error));
Run the script with node scraper.js and you should see real IMDb title markup at the top of the body, not a stripped-down shell. That confirms rendering works before you write a single selector. The Crawling API uses the JavaScript token you supplied to render the page in a real browser, so the rating and credits are present in the HTML you get back.
That first request just returned a fully rendered IMDb title page without a headless browser or a proxy on your side. The Crawling API runs the page in a real browser, rotates through residential IPs server-side, and handles the challenges IMDb throws at automated traffic, so you get finished HTML from one call. Point it at a public title on the free tier first, then add your parser.
Step 2: Parse the movie fields with Cheerio
With rendered HTML in hand, load it into Cheerio and read each field by its selector. IMDb marks up most of the metadata you want with stable data-testid attributes, which are friendlier to target than the generated class names. We pull the title and year from the page hero, the rating from the aggregate-rating block, the genre from the chip list, and the runtime and director from the title's detail rows. Reading each field defensively keeps one missing value from crashing the run.
const cheerio = require('cheerio'); function parseMovieFromHTML(html) { const $ = cheerio.load(html); const getText = (selector) => $(selector).first().text().trim(); // Read every chip in a labelled metadata row, joined into one string const getRowItems = (selector) => $(selector) .map((_, el) => $(el).text().trim()) .get() .join(', '); const title = getText( '[data-testid="hero__pageTitle"] .hero__primary-text', ); // The first metadata link under the hero title is the release year const year = getText( '[data-testid="hero__pageTitle"] + ul li:first-child a', ); const rating = getText( '[data-testid="hero-rating-bar__aggregate-rating__score"] span', ); const genre = getRowItems( '.ipc-chip-list--baseAlt .ipc-chip__text', ); const runtime = getRowItems( '[data-testid="title-techspec_runtime"] .ipc-metadata-list-item__content-container', ); const director = getRowItems( 'li:contains("Director") a.ipc-metadata-list-item__list-content-item--link:first', ); return { title, year, rating, genre, runtime, director }; }
A few details keep this faithful to the page. The title comes from the [data-testid="hero__pageTitle"] .hero__primary-text hero element, and the year is the first metadata link directly after it. The aggregate IMDb rating lives in [data-testid="hero-rating-bar__aggregate-rating__score"], the genre chips in the .ipc-chip-list--baseAlt .ipc-chip__text list, and the runtime in the title-techspec_runtime detail row. The director is read from the credits row that contains the label "Director", taking the first linked name. Joining the row items into one string keeps the output flat and easy to store.
IMDb's class names (the ipc-* and hashed suffixes) are generated and change without notice; the data-testid attributes are more stable but not guaranteed. Treat the selectors as a starting template, not a contract. When a field comes back empty, re-inspect the live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 3: Assemble the full script with JSON and CSV export
Now wire the fetch and the parse into one runnable script, then write the record to disk as both JSON and CSV. A plain script keeps the moving parts down; you can wrap it in an endpoint later if you want one.
const fs = require('fs'); const { CrawlingAPI } = require('crawlbase'); const cheerio = require('cheerio'); const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }); async function crawl(pageUrl) { const response = await api.get(pageUrl); if (response.statusCode === 200) return response.body; console.error(`Request failed: ${response.statusCode}`); return null; } function toCsv(row) { const headers = [ 'title', 'year', 'rating', 'genre', 'runtime', 'director', ]; const escape = (value) => `"${String(value).replace(/"/g, '""')}"`; const values = headers.map((h) => escape(row[h])); return [headers.join(','), values.join(',')].join('\n'); } async function main() { const url = 'https://www.imdb.com/title/tt0111161/'; const html = await crawl(url); if (!html) return; const movie = parseMovieFromHTML(html); fs.writeFileSync('movie.json', JSON.stringify(movie, null, 2)); fs.writeFileSync('movie.csv', toCsv(movie)); console.log(`Saved ${movie.title} to JSON and CSV`); } main();
Paste the parseMovieFromHTML function from Step 2 into the same file so main can call it. Run it with node scraper.js and you get two files: movie.json with the full structured record and movie.csv ready to open in a spreadsheet. The toCsv helper quotes every field and doubles any embedded quotes, which matters because titles and genre lists frequently contain commas.
What the output looks like
The JSON file holds one object with the title, year, IMDb rating, genre, runtime, and director.
{ "title": "The Shawshank Redemption", "year": "1994", "rating": "9.3", "genre": "Drama", "runtime": "2h 22m", "director": "Frank Darabont" }
The CSV mirrors the same record with a header line, so it drops straight into Excel, Google Sheets, or any data pipeline that reads delimited files.
title,year,rating,genre,runtime,director "The Shawshank Redemption","1994","9.3","Drama","2h 22m","Frank Darabont"
Scale to many titles
One title page is a demo; a real job collects metadata across a list of films. Because every IMDb title page shares the same hero and detail-row structure, the parser you already wrote works across all of them without changes. Keep a list of title URLs, fetch each through the Crawling API, parse it with the same function, and collect the records. Pace the requests with a short delay so you stay under IMDb's rate limits.
async function scrapeTitles(urls) { const movies = []; for (const url of urls) { const html = await crawl(url); if (!html) continue; const movie = parseMovieFromHTML(html); movies.push(movie); console.log(`Parsed ${movie.title || url}`); // Pace requests so you stay under the rate limit await new Promise((r) => setTimeout(r, 2000)); } return movies; }
For a larger backlog of titles you do not want to wait on synchronously, the async Crawler lets you push URLs and collect results without holding a connection open per request. For more on rendered, JavaScript-heavy pages like these, see our guide to crawling JavaScript websites.
Staying unblocked
Even with rendering handled, IMDb watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any large public site.
- Pace your requests. Introduce a delay between fetches rather than hammering the site in a tight loop. Spreading requests out is the single biggest factor in staying under IMDb's rate limits.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or non-200 responses is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked. If you want similar metadata from other entertainment sources, the same fetch-then-parse pattern carries straight over to scraping Rotten Tomatoes and Goodreads ratings.
Is it legal to scrape IMDb?
Whether scraping IMDb is allowed depends on IMDb's Conditions of Use, your jurisdiction, and what you do with the data. IMDb's terms restrict automated access and the reuse of its content, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read IMDb's Conditions of Use and its robots.txt, respect any rate expectations they state, and treat both as the boundary for what you collect. Limited collection of public factual fields for personal research is a very different thing from extensive or commercial-scale extraction, which IMDb does not permit without explicit permission.
This guide is deliberately scoped to public factual film metadata: the title, release year, aggregate user rating, genre, runtime, and credited director that anyone can see on a public title page without logging in. That is factual catalog data, not personal data, and it is the safe scope to stay within. What it does not cover is the copyrighted material on the same pages. Plot synopses, user reviews, editorial text, posters, and stills are protected content. Do not redistribute reviews, synopses, or images wholesale, and do not republish them as if they were yours. Keep your use to the small set of factual fields, and keep the volume modest.
If your project needs more than a handful of public fields, the sanctioned route is the right one, not a cleverer scraper. IMDb publishes official, licensable datasets for non-commercial use and runs commercial data licensing through IMDb and its parent for production needs. Those are the correct tools when you want large volumes, guaranteed structure, or the right to reuse the data commercially, and they come with clear usage and attribution terms. When you are unsure whether a use is allowed, get a data agreement rather than assuming silence is consent.
Key takeaways
- IMDb renders metadata client-side. A plain request returns a thin shell, so you must render the page behind a trusted IP, using the JavaScript token, before you parse it.
- The Crawling API does both in one call. It renders the page in a real browser and rotates residential IPs, returning finished HTML you parse with Cheerio.
-
Cheerio extracts the fields. Target the hero title, the aggregate-rating block, the genre chips, and the runtime and director detail rows, preferring the
data-testidattributes and expecting generated class names to drift. - Scale and export. Reuse the same parser across a list of title URLs, pace your requests, and write structured records to both JSON and CSV.
- Stay on public factual data. Collect title, year, rating, genre, runtime, and director only, never redistribute reviews, synopses, or images, respect the Conditions of Use and robots.txt, and prefer IMDb's official dataset or licensed feed for volume or commercial use.
Frequently Asked Questions (FAQs)
Can I build an IMDb scraper in a language other than JavaScript?
Yes. This guide uses JavaScript with Cheerio, but the same approach works in any language. The Crawling API has libraries and SDKs for several languages, so you fetch the rendered HTML the same way and parse it with whatever HTML parser your stack prefers, such as BeautifulSoup in Python. The selectors and fields stay the same; only the parsing syntax changes.
Why does a plain request return incomplete data from IMDb?
Because IMDb populates much of the title page in the browser with JavaScript and watches for automated traffic. A raw HTTP request from a datacenter IP usually returns a thin shell without the rating and credits, or a challenge page. To get a complete page you have to render it behind a trusted IP, which is what the Crawling API handles for you when you use the JavaScript token.
My selectors return empty values. What changed?
Almost certainly IMDb's markup. Its generated ipc-* class names change without notice, so selectors that worked last month can break. Prefer the more stable data-testid attributes where they exist, re-inspect a live page in your browser's dev tools, update the selectors in parseMovieFromHTML, and you are back in business. Periodic selector maintenance is normal for any production scraper.
Does IMDb have an official API or dataset?
IMDb does not offer a general-purpose public API, but it does publish official datasets you can download for personal and non-commercial use, and it licenses data commercially through IMDb and its parent. For production needs, large volumes, or commercial reuse, the licensed dataset or feed is the correct, sanctioned route. This public-data scraper is best for research, prototyping, and smaller-scale analysis where an official agreement is not warranted.
Can I scrape reviews, plot synopses, and posters too?
That is out of scope for this guide. Reviews, synopses, editorial text, posters, and stills are copyrighted content, and redistributing them wholesale infringes that copyright even though you can see them on a public page. Keep your collection to the factual fields covered here, the title, year, rating, genre, runtime, and director, and use IMDb's official dataset or a license if you need the protected material.
Will I get blocked while scraping IMDb?
You can, if you send too many requests too fast from one address. The Crawling API reduces that risk by rotating through residential IPs for you, but you should still pace your requests, add delays between fetches, and watch the status codes so you can back off when challenges appear. Those habits matter on any large public site.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
