Healthline is one of the most-visited health and wellness publishers on the web, with a deep archive of medically reviewed articles spanning nutrition, fitness, conditions, and mental health. Each public article carries a layer of structured metadata that is genuinely useful on its own: a headline, an author byline, a publish or updated date, a category, and a short summary. That metadata powers content research, trend analysis across health topics, and structured catalogs of what a major publisher is covering, all without touching the article prose itself.
This guide shows you how to scrape Healthline for that public article metadata using Python, then export the results to CSV. You will build a small, runnable scraper that fetches a rendered search or listing page through the Crawling API, parses each result with BeautifulSoup, and writes clean rows to disk. The whole walkthrough stays scoped to public metadata. Healthline article bodies are copyrighted editorial content, so we collect structure and summaries for research, never the full text for republishing.
What you will build
A Python script that takes a public Healthline listing or search URL, retrieves the rendered HTML through the Crawling API, and extracts a structured record for every article on the page. We will use a topic search as the running example and pull these public metadata fields per article:
- Article title the headline of the piece, for example "Antacids Associated with Higher Risk of Migraine".
- URL the canonical link to the public article page.
- Author the byline, when Healthline exposes it on the page.
- Publish or updated date the date the article went up or was last reviewed.
- Category the section the article sits under, such as health news or nutrition.
- Summary the short description or standfirst Healthline shows under the headline.
Why a plain request fails on Healthline
If you request a Healthline listing or search URL with a bare HTTP client, you get a response with status 200 and almost none of the article data in the body. Healthline renders its result cards in the browser with JavaScript, so the initial HTML is a shell that only fills in after the page's scripts run. A search results page in particular is assembled client-side from a data feed, which means the titles, links, and summaries you came for are not in the raw markup a plain requests call returns.
So a working Healthline scraper needs two things in one request: a browser that actually renders the page, and an IP the site reads as a real visitor. You can assemble that yourself with a headless browser and a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL with a JavaScript token, it renders the page behind a trusted IP, and it returns finished HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Healthline is client-side rendered, so you need the JS token here. Using the normal token returns the same empty shell a plain fetch would, and there is nothing to parse out of it.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to parsing HTML, our primer on how to use BeautifulSoup in Python covers the selector basics this tutorial leans on.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control. The free tier includes 1,000 requests, enough to follow this guide end to end.
Set up the project
Create a project folder and a virtual environment so dependencies stay isolated, then install the three libraries the scraper needs.
mkdir healthline_scraper cd healthline_scraper python -m venv healthline_env source healthline_env/bin/activate pip install crawlbase beautifulsoup4 pandas
On Windows, activate the environment with healthline_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector, and pandas structures the records and writes them to CSV.
Step 1: Fetch the rendered listing page
Start by getting the finished page. Import the CrawlingAPI class, initialize it with your JS token, and request the listing URL. Checking the status code before you parse keeps failures loud instead of silent. Note the two wait options: ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds so late-rendering cards appear before the page is captured.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"}) def crawl(page_url): options = {"ajax_wait": "true", "page_wait": 5000} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("utf-8") print(f"Request failed: {response['status_code']}") return None if __name__ == "__main__": page_url = "https://www.healthline.com/search?q1=migraine" html = crawl(page_url) print(html[:500] if html else "No HTML returned")
Run the script with python scraper.py and you should see real result-card markup, not the empty shell a plain fetch returns. Five seconds is a reasonable starting page_wait; raise it if the cards come back empty. That confirms rendering works before you write a single selector.
Healthline needs a fully rendered page behind a trusted IP, in one call, which is exactly what you just confirmed in Step 1. The Crawling API takes a JS token, runs the page in a real browser so the client-side article cards appear, and rotates through residential IPs server-side, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public search page on the free tier first.
Step 2: Inspect the result-card structure
Before writing selectors, open a Healthline search or listing page in your browser and inspect a result card with the dev tools. Healthline ships hashed, build-generated class names, so the exact strings change over time. At the time of writing, each search result links out through an <a> with a class like css-17zb9f8, and the short description sits in a sibling <div class="css-1evntxy">. The fields you want map roughly like this:
-
Article title and URL live on the result link: the link text is the title, the
hrefis the public article URL. - Summary sits in the description block, the short standfirst Healthline shows under each result.
-
Category can be read from the URL path, for example
/health-news/or/nutrition/, which Healthline uses as its section prefix.
Author and date are not always present on the listing card; they live on the article page itself, which Step 4 covers. Because Healthline's class names are hashed and churn on every deploy, treat them as a starting template rather than a contract, and re-inspect a live page when a field comes back empty.
Step 3: Parse the listing and export to CSV
With rendered HTML in hand, load it into BeautifulSoup, select every result link, and pull the title, URL, summary, and category off each one. Wrap the field reads so a missing element returns an empty string instead of crashing the run, then hand the records to pandas to write a CSV.
from bs4 import BeautifulSoup import pandas as pd def category_from_url(url): parts = url.split("/") return parts[3] if len(parts) > 3 else "" def parse_listing(html): soup = BeautifulSoup(html, "html.parser") articles = [] for link in soup.select('a.css-17zb9f8'): url = link.get("href", "") if not url: continue summary_el = link.find_next("div", class_="css-1evntxy") articles.append({ "title": link.get_text(strip=True), "url": url, "category": category_from_url(url), "summary": summary_el.get_text(strip=True) if summary_el else "", }) return articles def save_to_csv(data, filename): df = pd.DataFrame(data) df.to_csv(filename, index=False) print(f"Saved {len(data)} rows to {filename}")
The result link is read once for both the title and the href, the category is derived from the URL path so no separate selector is needed, and the summary flows through a guarded read that returns an empty string when the description block is absent. The save_to_csv helper turns the records into a pandas DataFrame and writes it with to_csv, the export the rest of this guide builds on.
Healthline's markup changes without notice, and the hashed class names above can be renamed on any deploy. When a field comes back empty across every card, re-inspect a live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 4: Enrich with author, date, and category from the article page
The listing gives you the title, URL, and summary. To fill in the author and publish or updated date, fetch each article page and read its public metadata. Healthline exposes the headline in an <h1>, the byline in a block carrying a data-testid="byline" attribute, and the date in a <time> element whose datetime attribute holds a machine-readable timestamp. We read only this metadata, never the article body.
def text_or_empty(soup, selector): el = soup.select_one(selector) return el.get_text(strip=True) if el else "" def parse_article_meta(html, url): soup = BeautifulSoup(html, "html.parser") time_el = soup.find("time") return { "title": text_or_empty(soup, "h1"), "url": url, "author": text_or_empty(soup, '[data-testid="byline"]'), "date": time_el.get("datetime", "") if time_el else "", "category": category_from_url(url), }
Each field has a fallback so a missing byline or date yields an empty string rather than an exception. The <time> element's datetime attribute is preferable to the visible date text because it is already in a consistent ISO format, which makes sorting and filtering trivial once the data is in CSV. Note what this function deliberately does not collect: the article's paragraphs. We stay on title, author, date, category, and the listing summary.
Step 5: Put it together
Now wire the listing scrape, the per-article enrichment, and the CSV export into one runnable script. Fetch the listing, parse it for URLs, fetch each article page for its metadata, and write everything to a single CSV. A short pause between requests keeps the run polite.
import time from crawlbase import CrawlingAPI from bs4 import BeautifulSoup import pandas as pd api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"}) OPTIONS = {"ajax_wait": "true", "page_wait": 5000} def crawl(page_url): response = api.get(page_url, OPTIONS) if response["status_code"] == 200: return response["body"].decode("utf-8") print(f"Request failed: {response['status_code']}") return None def main(): listing_url = "https://www.healthline.com/search?q1=migraine" listing_html = crawl(listing_url) if not listing_html: return listing = parse_listing(listing_html) records = [] for item in listing: article_html = crawl(item["url"]) if article_html: meta = parse_article_meta(article_html, item["url"]) meta["summary"] = item["summary"] records.append(meta) time.sleep(2) save_to_csv(records, "healthline_articles.csv") if __name__ == "__main__": main()
This assumes parse_listing, parse_article_meta, category_from_url, and save_to_csv from the earlier steps are in the same file. The flow is straightforward: one request for the listing, then one request per article to enrich it with author and date, then a single CSV write at the end. The summary from the listing card is merged onto each article's metadata so every row is complete.
What the output looks like
Run the full script with python scraper.py and you get a CSV with one row per article, each holding only public metadata, ready for analysis in pandas or a spreadsheet.
title,url,author,date,category,summary "Antacids Associated with Higher Risk of Migraine",https://www.healthline.com/health-news/antacids-increase-migraine-risk,"Nancy Schimelpfening",2024-01-09,health-news,"New research suggests people who take antacids may be at greater risk for migraine attacks." "Migraine: What to Ask Your Doctor",https://www.healthline.com/health/migraine/what-to-ask-doctor-migraine,"Healthline Editorial Team",2023-11-02,health,"A short list of questions to bring to your next appointment."
Because the date comes from the <time> element's datetime attribute, it lands in ISO format, so you can sort by recency or filter a date range without parsing free text. The category column, derived from each URL path, lets you group counts by section to see where a publisher is concentrating coverage.
Scaling across topics and pages
One listing is a demo; a real job runs across many topics and result pages. Healthline's search accepts a query parameter, so you can loop over a list of topics and reuse the same fetch-and-parse pair for each. Because every search result shares the same card structure, the parser you already wrote works across all of them without changes. Append the rows from each topic into one list and write a single CSV at the end.
def scrape_topics(topics): all_articles = [] for topic in topics: url = f"https://www.healthline.com/search?q1={topic}" html = crawl(url) if html: all_articles.extend(parse_listing(html)) time.sleep(2) return all_articles topics = ["migraine", "nutrition", "sleep"] save_to_csv(scrape_topics(topics), "healthline_topics.csv")
The time.sleep(2) between topics is deliberate. Hammering search in a tight loop is the fastest way to get throttled, even with rendering and rotation handled for you. Spread requests out, and stop early once a topic returns no new articles.
Staying unblocked
Even with rendering handled, Healthline watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any large publisher.
- Pace your requests. Spread requests out and vary your topics instead of crawling one search path at full speed.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked and the deeper dive on how to bypass captchas while web scraping. If you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy (also called the AI Proxy) gives you the same residential IP rotation as a drop-in proxy endpoint.
Is it legal to scrape Healthline?
Whether scraping Healthline is allowed depends on Healthline's terms of service, your jurisdiction, and what you do with the data. Healthline's terms restrict automated access and its content is copyrighted editorial work, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the Healthline Terms of Use and its robots.txt, and treat both as the boundary for what you collect.
A few lines worth holding to. Collect only public metadata: the article title, URL, author byline, publish or updated date, category, and the short summary that anyone can see on a public page without signing in. Do not scrape and republish full article bodies. Healthline's articles are copyrighted medical and editorial content, and reproducing them is a copyright problem, not just a terms problem. Respect Healthline's stated rate expectations and keep your request volume low enough that you are not straining its servers.
One more point specific to a health publisher: this guide is for cataloging and research on public metadata, not for sourcing medical guidance. Health information changes, and an article's accuracy depends on context a scraped row cannot capture. Do not rely on scraped health content for medical decisions, and consult a qualified medical or legal professional before acting on or redistributing any of it. Healthline does not offer a public API for bulk article access, so if your project needs the full content or large-scale redistribution, the correct path is to request permission or a licensing agreement, not a cleverer scraper.
Key takeaways
- Healthline is client-side rendered. A plain fetch returns an empty shell, so you must render the page before you parse it.
-
Rendering and a trusted IP go together. The Crawling API with a JS token does both in one call, using
ajax_waitandpage_waitso the article cards finish loading before capture. - BeautifulSoup plus pandas does the work. Map title, URL, author, date, category, and summary to the page's hooks, then export the rows straight to CSV.
- Scale by looping topics. Walk a list of search queries with the same parser, and pace the loop so you are not throttled.
- Stay on public metadata. Respect Healthline's ToS and robots.txt, never republish copyrighted article bodies, and consult a professional before relying on health content.
Frequently Asked Questions (FAQs)
Can I scrape Healthline with just requests and BeautifulSoup?
Not reliably. Healthline renders its result cards in the browser with JavaScript, so a raw requests call returns status 200 with the listings blank. You need something that renders the page first, which is what the Crawling API's JS token plus the ajax_wait and page_wait options handle before BeautifulSoup ever sees the HTML.
Do I need the normal token or the JS token for Healthline?
The JS token. The normal token fetches static HTML, which on Healthline is the same empty shell a plain fetch returns. The JS token renders the page in a real browser before handing back the HTML, so the article cards are present when BeautifulSoup parses them.
What data should I collect from Healthline?
Stick to public metadata: the article title, URL, author, publish or updated date, category, and the short summary shown on public pages. Do not scrape and republish full article bodies. Healthline's articles are copyrighted editorial content, so the safe and defensible scope is structure and summaries for research, not the prose itself.
How do I export the scraped data to CSV?
Build a list of dictionaries, one per article, then pass it to pandas with pd.DataFrame(data).to_csv("healthline_articles.csv", index=False). The save_to_csv helper in this guide wraps exactly that. Because each row holds only flat metadata fields, the CSV opens cleanly in any spreadsheet or loads back into pandas for analysis.
My selectors return empty strings. What changed?
Almost certainly Healthline's markup. The hashed class names the parser relies on churn on every deploy, and a redesign can rename the byline or date hooks. Re-inspect a live page in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.
Is the scraped health information safe to rely on?
No. Treat scraped rows as metadata for cataloging and research, not as medical guidance. Health information changes and depends on context a single field cannot capture, so consult a qualified medical or legal professional before acting on or redistributing any of it. For full content or large-scale redistribution, request permission or a license from Healthline rather than scraping the bodies.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
