Google News aggregates headlines from thousands of publishers into one constantly refreshing feed, which makes it a tempting source if you are tracking a topic, watching how a story spreads, or building a dataset of who reported what and when. The catch is the page is built for browsers, not scripts: it renders client-side and challenges automated traffic, so a bare HTTP request hands you a near-empty shell instead of the headlines you came for.
This guide shows you how to scrape Google News with JavaScript. You will build a small, runnable Node script that fetches the rendered page through the Crawlbase Smart Proxy (also called the Smart AI Proxy), then parses headlines, publishers, publish times, and authors with Cheerio and writes them out as JSON. Everything here is scoped to public results that anyone can see without logging in, and the legality section near the end is worth reading before you point this at any real volume.
Why scrape Google News
A single visit to Google News tells you what the feed looks like at one moment. The value of scraping it is turning that moving stream into structured rows you can store, query, and compare over time. A few concrete uses:
- Trend and topic tracking. Watch which stories surface for a query and how coverage shifts across the day, useful for research, journalism, and SEO planning.
- Competitor and brand monitoring. Collect mentions of a company, product, or executive across many outlets in one pass instead of checking sites by hand.
- Market and financial signal. Aggregate headlines around a sector or ticker to feed a dashboard or a model.
- Content curation. Pull fresh, relevant headlines into a newsletter or internal feed without manually trawling sources.
In every case the output is the same shape: title, source, timestamp, link, and author rows you can do something useful with. That is the data this scraper produces.
What data you can extract
Before writing a selector, it helps to know which fields live on a Google News results page and are worth pulling. Each article card in the feed exposes a predictable set of values:
- Headline. The article title as it appears in the feed.
- Publisher. The source outlet that ran the story (CNN, The Hill, and so on).
- Publish time. A relative timestamp such as "21 minutes ago" or "9 hours ago".
- Author. The byline when the card carries one; many cards do not, so expect blanks.
- Article link. The href on the headline, which points to the source publication.
That set covers most monitoring and research jobs. We will extract headline, publisher, time, and author in the code below; pulling the link is a one-line addition once you have the card in hand.
Why a plain fetch fails here
Request a Google News URL with a bare HTTP client and you get a 200 response whose body is mostly an empty frame. Two things work against you. First, the feed renders in the browser: the initial HTML is a shell that fills in only after the page's JavaScript runs. Second, Google flags automated traffic quickly, so datacenter IPs and non-browser request patterns get throttled or challenged before they ever reach the rendered content.
So a working Google News scraper needs two things at once: something that renders the page like a real browser, and an IP the platform reads as a genuine visitor. You can build that yourself with a headless browser plus a pool of rotating IP addresses, but assembling that stack and keeping it healthy is most of the effort. The Crawlbase Smart Proxy folds both into a single endpoint: point your normal HTTP client at it, and it routes the request through rotating residential and datacenter IPs and returns the page for you to parse.
The Smart Proxy is a drop-in proxy endpoint: you keep using your existing HTTP client and only change where it connects. Under the hood it forwards to the Crawling API, so you get the same rendering and IP rotation. If you would rather call an API directly and pass options like JavaScript rendering explicitly, use the Crawling API instead; this guide takes the proxy route because it is the smallest change to a normal request.
Prerequisites
You need three things before writing any code:
-
Node.js installed. Node runs JavaScript outside the browser and gives you npm for pulling in libraries. Confirm it with
node --version. - Basic JavaScript. Comfort with variables, functions, and async calls is enough to follow along.
- A Crawlbase token. Sign up for a free account, open the Smart Proxy dashboard, and copy the access token from the connection details. That token acts as your proxy username, so it goes into the request itself.
Set up the project
Create a project folder, initialize it, and install the two libraries the scraper needs.
node --version mkdir google-news-scraper && cd google-news-scraper npm init -y npm install axios cheerio
Two dependencies do the work: axios makes the HTTP request and is easy to point at a proxy, and cheerio parses the returned HTML with a jQuery-like API on the server. Node's built-in https and fs modules cover the proxy agent and writing files, so there is nothing else to install.
Fetch the rendered page through the Smart Proxy
The first job is getting the finished HTML. You configure an HTTPS agent that points at the Smart Proxy host and port, pass your token as the proxy username, and let axios send the request through it. The proxy renders the page and returns real markup instead of the empty shell a direct fetch gives you. Replace YOUR_CRAWLBASE_TOKEN with the token from your dashboard.
const axios = require('axios') const https = require('https') const fs = require('fs') const token = 'YOUR_CRAWLBASE_TOKEN' const url = 'https://news.google.com/home?hl=en-US&gl=US&ceid=US%3Aen' const agent = new https.Agent({ proxy: { host: 'smartproxy.crawlbase.com', port: 8012, auth: { username: token }, }, rejectUnauthorized: false, }) async function fetchGoogleNews(target) { const response = await axios.get(target, { httpsAgent: agent }) fs.writeFileSync('response.html', response.data) console.log('Status:', response.status, '- saved response.html') return response.data } fetchGoogleNews(url).catch((err) => console.error('Fetch failed:', err.message))
A few details are doing real work here. The agent routes every request through smartproxy.crawlbase.com on port 8012, your token rides along as the proxy username, and rejectUnauthorized: false lets the proxy terminate TLS without a certificate mismatch. Saving the body to response.html means you can fetch once and iterate on selectors against the local file instead of burning a request on every change.
Run it with node scraper.js. You should see a 200 status and a response.html that contains real article markup, not an empty frame. That confirms rendering and routing are working before you write a single selector.
Google News needs a rendered page behind a trusted IP, and the Smart Proxy gives you both without changing how your code makes requests. Point your existing HTTP client at the proxy endpoint and it rotates through residential and datacenter IPs, renders the page, and returns finished HTML, so you skip running a headless fleet and a proxy pool yourself. Start on the free tier against a public feed.
Parse the results with Cheerio
With the HTML saved, load it into Cheerio and walk the article cards. Each card carries the fields you want, so the job is mapping headline, publisher, time, and author to the right selector inside the card. Inspect a live results page in your browser's dev tools to confirm the current class names, then wire them up.
const fs = require('fs') const cheerio = require('cheerio') function parseArticle(card) { return { headline: card.find('a.gPFEn').text().trim(), publisher: card.find('.vr1PYe').text().trim(), time: card.find('time.hvbAAd').text().trim(), author: card.find('.bInasb span[aria-hidden="true"]').text().trim(), } } function extractArticles(html) { const $ = cheerio.load(html) const articles = [] $('article.UwIKyb').each((i, el) => { articles.push(parseArticle($(el))) }) return articles }
The pattern is straightforward: select every article card, then for each one pull the headline link, the publisher label, the <time> element, and the author span. The .trim() on each call strips the stray whitespace Google's markup leaves behind. To capture the link as well, read the headline's href with card.find('a.gPFEn').attr('href') inside parseArticle.
Google's class names (the short hashed strings like gPFEn and UwIKyb) change without notice. Treat the selectors above as a starting template, not a contract. When a field starts returning empty strings, re-inspect a live page in your browser's dev tools and update the selector. This is routine maintenance for any production scraper, not a sign anything is broken.
The full scraper
Here is the fetch and parse wired into one runnable file. It requests the page through the Smart Proxy, saves the HTML, parses the cards, and prints the structured results. Drop in your token and run it.
const axios = require('axios') const https = require('https') const fs = require('fs') const cheerio = require('cheerio') const token = 'YOUR_CRAWLBASE_TOKEN' const url = 'https://news.google.com/home?hl=en-US&gl=US&ceid=US%3Aen' const agent = new https.Agent({ proxy: { host: 'smartproxy.crawlbase.com', port: 8012, auth: { username: token }, }, rejectUnauthorized: false, }) async function fetchGoogleNews(target) { const response = await axios.get(target, { httpsAgent: agent }) fs.writeFileSync('response.html', response.data) return response.data } function parseArticle(card) { return { headline: card.find('a.gPFEn').text().trim(), publisher: card.find('.vr1PYe').text().trim(), time: card.find('time.hvbAAd').text().trim(), author: card.find('.bInasb span[aria-hidden="true"]').text().trim(), } } function extractArticles(html) { const $ = cheerio.load(html) const articles = [] $('article.UwIKyb').each((i, el) => articles.push(parseArticle($(el)))) return articles } async function main() { const html = await fetchGoogleNews(url) const articles = extractArticles(html) fs.writeFileSync('articles.json', JSON.stringify(articles, null, 2)) console.log(articles) } main().catch((err) => console.error('Scrape failed:', err.message))
What the output looks like
Run the full script and you get an array of structured article objects, written to articles.json and echoed to the console. A trimmed sample:
[ { "headline": "Morning Report: Biden, Trump duel over border during separate Texas stops", "publisher": "The Hill", "time": "21 minutes ago", "author": "Alexis Simendinger & Kristina Karisch" }, { "headline": "Takeaways from the dueling border visits", "publisher": "CNN", "time": "9 hours ago", "author": "" }, { "headline": "Funeral draws crowds to Moscow church despite tight security", "publisher": "CBS News", "time": "5 minutes ago", "author": "Haley Ott" } ]
Notice the empty author on the second card. Many Google News entries carry no byline, so blanks are normal, not a parsing bug. Filter or default those fields downstream depending on what your job needs.
Staying unblocked
Even with rendering and rotation handled, Google watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard target.
- Pace your requests. Hammering the same feed in a tight loop is the fastest way to get throttled. Spread requests out and vary the query or topic you pull.
- Lean on rotation. Spreading requests across many real-user IPs is what keeps a single address from tripping a rate limit. The Smart Proxy handles this for you; if you roll your own, weigh datacenter against residential proxies and get the rotation right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate is too high. Back off rather than pushing through.
For the broader playbook, see how to scrape websites without getting blocked. If you are scraping news at scale across many topics, large-scale web scraping covers the orchestration side, and building a search engine tool shows what you can assemble once you have a steady feed of structured results.
Is it legal to scrape Google News?
Whether scraping Google News is allowed depends on Google's terms of service, your jurisdiction, and what you do with the data, so treat this as guidance and not legal advice. Google News is an aggregator: the headlines and snippets point to content owned by individual publishers, and Google's terms place limits on automated access. None of the code here changes that. It only makes the technical part work.
A few lines worth holding to. Collect only public results, the headlines, sources, timestamps, and links anyone can see without an account. Check Google's robots.txt and respect its stated rate expectations, and keep your request volume low enough that you are not straining anyone's servers. Do not reproduce or redistribute full article text, which belongs to the original publishers and is typically copyrighted. And never collect personal or private data.
This guide is deliberately scoped to public feed data, because that is the line that keeps the work defensible. It does not cover anything behind a login, account or profile data, personalized feeds tied to an identifiable user, or any attempt to bypass authentication. If your project needs more than public headlines, the right move is an official data agreement or a licensed news API, not a cleverer scraper.
Key takeaways
- Google News is client-side rendered. A plain fetch returns a near-empty shell, so you must render the page before you parse it.
- The Smart Proxy gives you rendering and a trusted IP. Point your existing HTTP client at the proxy endpoint with your token as the username and it rotates IPs and renders the page in one step.
- Cheerio does the extraction. Map headline, publisher, time, and author to current selectors, and expect those selectors to drift over time.
- Blank fields are normal. Many cards have no author, so handle empty strings rather than treating them as errors.
- Stay on public data. Respect Google's ToS and robots.txt; no full article text, no personal data, no login-walled feeds.
Frequently Asked Questions (FAQs)
Why does a plain fetch return no headlines from Google News?
Because Google News renders its feed client-side with JavaScript. The initial HTML is a shell that fills in only after the page's scripts run in a browser, so a raw HTTP request returns status 200 with the article fields empty. To get real data you have to render the page first, which is what routing through the Smart Proxy handles for you.
What is the Crawlbase Smart Proxy?
The Smart Proxy (also called the Smart AI Proxy) is a drop-in proxy endpoint that rotates through a large pool of residential and datacenter IPs and renders pages behind the scenes. You keep using your normal HTTP client and only change where it connects, passing your access token as the proxy username. Under the hood it forwards to the Crawling API, so you get rendering and IP rotation without managing either yourself.
Should I use the Smart Proxy or the Crawling API for Google News?
Both reach the same engine, so pick by integration style. The Smart Proxy is the smallest change to existing code: set a proxy host, port, and token and your current request works. The Crawling API is a direct API call where you pass options like JavaScript rendering and wait times explicitly, which is handier when you need fine control. This guide uses the proxy because it requires almost no change to a normal axios request.
Why are some author fields empty in the output?
Many Google News cards simply do not carry a byline, so the author selector returns an empty string for those entries. That is expected, not a parsing failure. Handle it downstream by filtering those rows, defaulting the field, or leaving it blank depending on what your job needs.
My Cheerio selectors return empty strings. What changed?
Almost certainly Google's markup. The hashed class names like gPFEn and UwIKyb change without notice, so selectors that worked last month can break. Re-inspect a live results page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.
Can I use the Smart Proxy to scrape other sites besides Google News?
Yes. The Smart Proxy is a general endpoint, so the same setup works for most public sites: change the target URL and adjust your selectors to the new page. Its IP rotation and rendering help on a wide range of targets, which is the same approach covered in how to scrape websites without getting blocked.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
