Stack Overflow is one of the largest public knowledge bases for developers, and every question listing page carries structured signals worth collecting: the question title, the tags it was filed under, its vote count, how many answers it has, how many people viewed it, and a link to the full thread. Aggregated across a tag, that data tells you which topics are heating up, which problems go unanswered, and how a technology's questions trend over time.
This guide shows you how to scrape Stack Overflow questions with JavaScript and Node.js using cheerio. You build a small, runnable scraper that fetches a public question listing page through the Crawling API, parses one record per question, handles pagination across a tag, and exports the results to JSON and CSV. The whole walkthrough stays scoped to public listing data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Node.js script that takes a public Stack Overflow tag URL, retrieves the page HTML through the Crawling API, and extracts a structured record for each question in the listing. We use the javascript tag as the running example and pull these fields per question:
- Title the question text, for example "How do I return the response from an asynchronous call?".
- Tags the list of tags the question was filed under, like "javascript, async-await, promise".
- Votes the net vote count shown on the summary card.
- Answers the number of answers, with "0 answers" when none exist yet.
- Views the view count as displayed on the card.
- Link the absolute URL to the individual question page.
Why a plain request can fall short on Stack Overflow
Stack Overflow serves a fair amount of the listing markup server-side, so a bare HTTP request gets you further here than on a heavily client-rendered site. The problem is consistency at volume. Stack Overflow watches for automated traffic, and a datacenter IP making rapid, repetitive requests gets rate-limited or served a challenge page instead of question markup. When that happens your parser sees an unexpected layout and the run quietly degrades.
So a reliable Stack Overflow scraper needs an IP the site reads as a real visitor and, on pages that lean on scripts, a browser that renders before you parse. You can assemble that yourself with a pool of rotating residential proxies and a headless browser, but keeping that stack healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it fetches behind a trusted IP (and renders the page when you pass the JavaScript token), and it returns finished HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML and is enough for the server-rendered Stack Overflow listing used here. The JavaScript (JS) token renders the page in a real browser first, which you reach for when a target loads its content client-side. Start with the normal token for these listing pages; switch to the JS token if a page you target comes back missing fields.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Node.js 16 or later. Confirm your version with node --version. If you do not have it, install it from the Node.js website or through a version manager like nvm.
Basic JavaScript and Node.js. You should be comfortable writing and running a Node script and installing packages with npm. If you are new to Node, the official docs and any beginner course will get you to the level this tutorial assumes. For a fuller walkthrough, see our guide on how to build a web scraper with Node.js.
A Crawlbase account and token. Sign up, open your dashboard, and copy your normal requests token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a project folder, initialize it, and install the two libraries the scraper needs.
node --version mkdir stackoverflow-scraper && cd stackoverflow-scraper npm init -y npm install crawlbase cheerio
Two dependencies do the work: crawlbase is the official Node client for the Crawling API, and cheerio parses the returned HTML with a jQuery-style API so you can pull out individual fields by CSS selector. If selectors are new to you, the primer on XPath and CSS selectors is a good companion.
Step 1: Fetch the question listing page
Start by getting the page. Import the CrawlingAPI class, initialize it with your token, and request the tag URL. Checking the status code before you parse keeps failures loud instead of silent.
const { CrawlingAPI } = require('crawlbase'); const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }); async function crawl(pageUrl) { const response = await api.get(pageUrl); if (response.statusCode === 200) { return response.body; } console.error(`Request failed: ${response.statusCode}`); return null; } const tagUrl = 'https://stackoverflow.com/questions/tagged/javascript'; crawl(tagUrl).then((html) => { console.log(html ? html.slice(0, 500) : 'No HTML returned'); });
The tag URL follows a fixed shape: https://stackoverflow.com/questions/tagged/<tag>. Swap javascript for any tag you want to study, like python or node.js. Run the script with node scraper.js and you should see real question markup, not a challenge page. That confirms the request works before you write a single selector.
That single api.get call is doing more than a plain request would: it fetches the tag page behind a trusted IP and rotates through residential addresses server-side, so Stack Overflow reads your traffic as a real visitor instead of a scraper to throttle. You skip running a headless browser fleet and a proxy pool yourself, and when a target needs rendering you just add the JavaScript token. Point it at a public tag page on the free tier first.
Step 2: Parse each question with cheerio
With the HTML in hand, load it into cheerio and walk the question cards. Stack Overflow lays out each question in a repeating .js-post-summary block inside #questions, so you select every summary, then read the title, tags, votes, answers, views, and link from inside it. The .replace(/\s+/g, ' ').trim() chain collapses the whitespace Stack Overflow pads its markup with into clean single-spaced text.
const cheerio = require('cheerio'); const clean = (text) => text.replace(/\s+/g, ' ').trim(); function parseQuestions(html) { const $ = cheerio.load(html); const questions = []; $('#questions .js-post-summary').each((_, element) => { const el = $(element); const title = clean(el.find('.s-post-summary--content-title').text()); const link = el.find('.s-link').attr('href') || ''; const votes = clean( el.find('.js-post-summary-stats .s-post-summary--stats-item:first-child').text() ); const answers = clean(el.find('.js-post-summary-stats .has-answers').text()) || '0 answers'; const views = clean( el.find('.js-post-summary-stats .s-post-summary--stats-item:last-child').text() ); const tags = el .find('.js-post-tag-list-item') .map((__, tag) => clean($(tag).text())) .get() .filter(Boolean); questions.push({ title, tags, votes, answers, views, link: link.includes('https://') ? link : `https://stackoverflow.com${link}`, }); }); return questions; }
A few details keep this faithful to the page. Votes and views both live under .js-post-summary-stats as .s-post-summary--stats-item entries, so the first matches votes and the last matches views. The answers count carries a .has-answers class only when a question has answers, which is why it falls back to '0 answers' when the selector returns empty. Tags come from every .js-post-tag-list-item, mapped into an array so you keep them structured. The link is read from the anchor's href and made absolute, since Stack Overflow returns a relative path like /questions/123/....
Stack Overflow's class names (js-post-summary, s-post-summary--content-title, js-post-tag-list-item, and the rest) can change without notice. Treat the selectors above as a starting template, not a contract. When a field comes back empty, re-inspect the live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 3: Put it together and export
Now wire the fetch and the parse into one runnable script, then write the records to both JSON and CSV. JSON keeps the nested tag array intact for programmatic use; CSV flattens each record into a row for spreadsheets, joining the tags with a separator.
const { CrawlingAPI } = require('crawlbase'); const cheerio = require('cheerio'); const fs = require('fs'); const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' }); const clean = (text) => text.replace(/\s+/g, ' ').trim(); async function crawl(pageUrl) { const response = await api.get(pageUrl); if (response.statusCode === 200) return response.body; console.error(`Request failed: ${response.statusCode}`); return null; } function parseQuestions(html) { const $ = cheerio.load(html); const questions = []; $('#questions .js-post-summary').each((_, element) => { const el = $(element); const link = el.find('.s-link').attr('href') || ''; questions.push({ title: clean(el.find('.s-post-summary--content-title').text()), tags: el .find('.js-post-tag-list-item') .map((__, tag) => clean($(tag).text())) .get() .filter(Boolean), votes: clean( el.find('.js-post-summary-stats .s-post-summary--stats-item:first-child').text() ), answers: clean(el.find('.js-post-summary-stats .has-answers').text()) || '0 answers', views: clean( el.find('.js-post-summary-stats .s-post-summary--stats-item:last-child').text() ), link: link.includes('https://') ? link : `https://stackoverflow.com${link}`, }); }); return questions; } function toCsv(rows) { const headers = ['title', 'tags', 'votes', 'answers', 'views', 'link']; const escape = (value) => `"${String(value).replace(/"/g, '""')}"`; const lines = [headers.join(',')]; for (const row of rows) { lines.push( [ escape(row.title), escape(row.tags.join('|')), escape(row.votes), escape(row.answers), escape(row.views), escape(row.link), ].join(',') ); } return lines.join('\n'); } async function main() { const tagUrl = 'https://stackoverflow.com/questions/tagged/javascript'; const html = await crawl(tagUrl); if (!html) return; const questions = parseQuestions(html); fs.writeFileSync('questions.json', JSON.stringify(questions, null, 2)); fs.writeFileSync('questions.csv', toCsv(questions)); console.log(`Saved ${questions.length} questions to questions.json and questions.csv`); } main();
Run the full script with node scraper.js. It fetches the tag page, parses every question card, and writes both questions.json and questions.csv to your project folder. The CSV escapes quotes and joins the tag array with a pipe so a question with multiple tags stays in a single cell.
What the output looks like
The JSON file holds one object per question, with the tags kept as a structured array, ready to load into an analysis script or a database.
[ { "title": "How do I return the response from an asynchronous call?", "tags": ["javascript", "ajax", "asynchronous", "promise"], "votes": "8632 votes", "answers": "42 answers", "views": "2.1m views", "link": "https://stackoverflow.com/questions/14220321/how-do-i-return-the-response-from-an-asynchronous-call" }, { "title": "What does \"use strict\" do in JavaScript?", "tags": ["javascript", "syntax", "jslint", "use-strict"], "votes": "9201 votes", "answers": "32 answers", "views": "1.0m views", "link": "https://stackoverflow.com/questions/1335851/what-does-use-strict-do-in-javascript" } ]
The CSV mirror of the same data is one header row plus one row per question, with the tags joined into a single pipe-delimited cell.
title,tags,votes,answers,views,link "How do I return the response from an asynchronous call?","javascript|ajax|asynchronous|promise","8632 votes","42 answers","2.1m views","https://stackoverflow.com/questions/14220321/..." "What does ""use strict"" do in JavaScript?","javascript|syntax|jslint|use-strict","9201 votes","32 answers","1.0m views","https://stackoverflow.com/questions/1335851/..."
Loop through tag pages
One page of questions is a demo; a real job walks the pagination. Stack Overflow exposes the page number through the page query parameter, so you can build each page URL in a loop, fetch it through the Crawling API, parse it with the same function, and collect the rows. Because every listing page shares the same card structure, the parser you already wrote works across all of them without changes.
async function scrapeTag(tag, totalPages) { const all = []; for (let page = 1; page <= totalPages; page++) { const url = `https://stackoverflow.com/questions/tagged/${tag}?tab=newest&page=${page}`; const html = await crawl(url); if (html) all.push(...parseQuestions(html)); } return all; } scrapeTag('javascript', 3).then((rows) => { console.log(`Collected ${rows.length} questions`); });
To enrich each row with the full question body, accepted answer, or comment thread, take the link from each card and fetch that individual question page through the same crawl function, then write a small parser for the question layout. The pattern is identical: fetch, then parse. For more on rendering-heavy targets, see how to crawl JavaScript websites.
Staying unblocked
Even with a trusted IP handled for you, Stack Overflow watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any site you scrape at volume.
- Pace your requests. Hammering pages in a tight loop is the fastest way to get throttled. Spread requests out and vary your tags instead of crawling one path at full speed.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked. If you want to compare parsing stacks beyond cheerio, the rundown of top open source scraping libraries is a useful map. And if you are collecting developer-community data more broadly, the same fetch-then-parse pattern carries over to scraping GitHub repositories and profiles.
Is it legal to scrape Stack Overflow?
Whether scraping Stack Overflow is allowed depends on its terms of service, your jurisdiction, and what you do with the data. Stack Overflow, part of the Stack Exchange network, publishes a public network terms of service and an Acceptable Use Policy that restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the Stack Exchange terms and the site's robots.txt, and treat both as the boundary for what you collect and how fast you collect it.
Before you write a scraper at all, check whether the sanctioned path covers your need, because for Stack Overflow it often does. Stack Exchange offers an official Stack Exchange API that returns questions, tags, votes, answers, and views as clean JSON, and it publishes periodic data dumps of the full public content under a Creative Commons license. For research, analysis, or anything at volume, the API and the data dumps are the right tools: they are structured, rate-limited on terms you agree to, and they keep you on the right side of the network's policies. Reach for scraping only for small, public, one-off needs the API does not serve.
Keep the work scoped to public, non-personal data. Question titles, tags, and the aggregate vote, answer, and view counts used in this guide are public listing signals. User content is a different matter: usernames, reputation, profile details, and the text people write are personal data, and republishing an individual's content or tying it to their identity can trigger obligations under privacy laws like the GDPR and CCPA, including a lawful basis to process and honoring deletion requests. This guide does not cover anything behind a login, private inbox messages, or building profiles of identifiable users. Aggregate where you can, and prefer the official API or data dumps whenever your project touches user-level data.
Key takeaways
- Fetch behind a trusted IP. Stack Overflow throttles scraper-shaped traffic, so the Crawling API fetches each tag page from a rotating residential IP and returns clean HTML to parse.
-
cheerio does the extraction. Select every
.js-post-summarycard inside#questions, then map title, tags, votes, answers, views, and the link to current selectors, and expect those selectors to drift. -
Keep tags structured. Read every
.js-post-tag-list-iteminto an array so a question's tags stay queryable in JSON and collapse to one cell in CSV. -
Scale by looping pages. The
pageparameter walks a tag's listing, and the same parser works across every page with sensible pacing. - Prefer the official path. The Stack Exchange API and CC-licensed data dumps are the sanctioned route for volume; stay on public data, respect the ToS and robots.txt, and avoid user-level personal data.
Frequently Asked Questions (FAQs)
Do I need the normal token or the JS token for Stack Overflow?
The normal token is enough for the question listing pages in this guide, because Stack Overflow serves that markup server-side. Reach for the JS token when a target page loads its content client-side and comes back missing fields. Start with the normal token here, and switch only if a page you scrape returns empty selectors.
What fields can I extract from a Stack Overflow question listing?
From each summary card you can pull the question title, the tags it was filed under, the net vote count, the number of answers, the view count, and the link to the full question. This guide maps each of those to a CSS selector and assembles them into one record per question, exported to JSON and CSV.
My selectors return empty values. What changed?
Almost certainly Stack Overflow's markup. Its js-post-summary card classes, s-post-summary--content-title title wrapper, and js-post-tag-list-item tag markers can change without notice. Re-inspect a live page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.
Should I use the Stack Exchange API or scrape the site?
If you need volume, guaranteed structure, or the full public content, use the official Stack Exchange API or its Creative Commons data dumps. They are built for that and keep you on the right side of the network's terms. Scraping public listing pages with the approach in this guide fits small, public-data needs the API does not serve, as long as you respect the ToS, robots.txt, and rate limits.
Can I scrape user profiles or reputation from Stack Overflow?
This guide deliberately does not. Usernames, reputation, and the content people write are personal data, and building profiles of identifiable users can trigger obligations under privacy laws like the GDPR and CCPA. Stay on public listing signals such as titles, tags, and aggregate counts, aggregate where you can, and use the official API if your project genuinely needs user-level data.
How do I avoid getting blocked while scraping Stack Overflow?
Keep your per-IP request rate low, vary your tags instead of looping one path, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off when you start seeing challenges.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

