TechCrunch publishes dozens of stories a day on startups, funding rounds, product launches, and the people moving the technology industry. Each article carries a tidy set of public metadata that a trend tracker, a market-research dashboard, or a newsroom monitor actually wants: the headline, who wrote it, when it went live, which category and tags it sits under, the article URL, and a short excerpt. The catch is that TechCrunch runs on a hardened WordPress stack that flags automated traffic quickly, so a naive scraper gets challenged or blocked long before it has collected anything useful.
This guide shows you how to scrape TechCrunch with Python the reliable way. You build a small, runnable scraper that fetches a listing page through the Crawling API, parses each article card with BeautifulSoup, and prints clean structured output. The whole walkthrough stays scoped to public article metadata, never full article bodies, and the legality section near the end is not boilerplate. Read it before you point this at any real volume.
What you will build
A Python script that takes a public TechCrunch listing URL, retrieves the HTML through the Crawling API, and extracts a structured record for every article on the page. We will use the TechCrunch homepage feed as the running example and pull these fields from each card:
- Headline the article title as it appears in the listing.
- Article URL the link to the individual story.
- Author the byline credited on the card.
-
Publish date the machine-readable timestamp from the
datetimeattribute. - Category and tags the section or topic the article is filed under.
- Excerpt the short summary shown beneath the headline.
Why a plain request fails on TechCrunch
You can point Python's requests at a TechCrunch URL and sometimes get HTML back, but a real scraping run rarely stays that easy. TechCrunch sits behind an edge layer that watches for scraper-shaped traffic, and two things work against you. First, datacenter IPs and request patterns that do not look like a real browser get rate limited or served a challenge after the first handful of requests, and repeated hits from one address trip that threshold fast. Second, some listing and feed views fill in content with JavaScript, so the raw HTML you fetch can be missing the cards you came for.
So a dependable TechCrunch scraper needs two things in one request: an IP the platform reads as a real visitor, and, where the view is client-rendered, a browser that actually runs the page's scripts. You can assemble that yourself with a pool of rotating IPs plus a headless browser, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL, it fetches the page behind a trusted, rotating IP, optionally renders JavaScript, and returns finished HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. TechCrunch listings are largely server-rendered WordPress markup, so the normal token is usually enough here. If a particular feed comes back with empty cards, switch to the JS token to render it. You can start with 1,000 free requests, no credit card needed.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If BeautifulSoup is new to you, our guide to using BeautifulSoup in Python covers the parsing basics this tutorial assumes.
Python 3.8 or later. Confirm your version with python --version. If you do not have it, install it from python.org or through a distribution like Anaconda.
A Crawlbase account and token. Sign up, open your dashboard, and copy your normal token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the libraries the scraper needs.
python --version python -m venv techcrunch_env source techcrunch_env/bin/activate pip install crawlbase beautifulsoup4 pandas
On Windows, activate the environment with techcrunch_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML so you can pull out individual fields by CSS selector, and pandas makes it easy to write the records out to CSV at the end.
Step 1: Fetch the listing page
Start by getting the page. Import the CrawlingAPI class, initialize it with your token, and request the listing URL. Checking the status before you parse keeps failures loud instead of silent.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def crawl(page_url): options = {"country": "US"} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("utf-8") print(f"Request failed: {response['status_code']}") return None if __name__ == "__main__": page_url = "https://techcrunch.com" html = crawl(page_url) print(html[:500] if html else "No HTML returned")
The country option pins the request to a US exit IP, which matters because TechCrunch can serve different content by region. Run the script with python scraper.py and you should see real article markup in the first 500 characters, not a block page or an empty shell. That confirms the fetch works behind a trusted IP before you write a single selector. If the cards come back empty, re-run with a JS token to render the page, as the callout above describes.
TechCrunch challenges datacenter IPs fast, and that status 200 you just checked is only reliable when the request comes from an address the platform trusts. The Crawling API rotates through residential IPs server-side, optionally renders JavaScript, and hands you finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public listing page on the free tier first.
Step 2: Parse the article cards with BeautifulSoup
With HTML in hand, load it into BeautifulSoup and pull each article by its selector. TechCrunch lays its listings out as a repeated block, so you select all the article containers once and then read the same fields from each one. On the WordPress markup, each article sits inside a container with the class wp-block-tc23-post-picker, which is the anchor you loop over. Inspect the live page in your browser's dev tools to confirm the current class names, since that markup drifts over time.
from bs4 import BeautifulSoup def text_of(node, selector): found = node.select_one(selector) return found.get_text(strip=True) if found else "" def parse_listings(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select("div.wp-block-tc23-post-picker") articles = [] for card in cards: title_el = card.select_one("h2.wp-block-post-title") link_el = title_el.select_one("a") if title_el else None time_el = card.select_one("time") articles.append({ "headline": title_el.get_text(strip=True) if title_el else "", "url": link_el["href"] if link_el else "", "author": text_of(card, "div.wp-block-tc23-author-card-name"), "publish_date": time_el["datetime"] if time_el else "", "category": text_of(card, "div.wp-block-tc23-post-picker__category a"), "excerpt": text_of(card, "p.wp-block-post-excerpt__excerpt"), }) return articles
Two patterns keep this resilient. The text_of helper returns an empty string instead of raising when a selector misses, so one malformed card never crashes the run. And reading the publish date from the datetime attribute of the <time> tag gives you a clean ISO timestamp rather than the human-friendly text the card displays, which is far easier to sort and filter downstream. The category selector targets the small topic link above each headline; a card with no category yields an empty string.
Step 3: Assemble the full script
Now combine the fetch and the parser into one runnable file, point it at the homepage feed, and write the records to disk. The main function ties the pieces together and saves a CSV with pandas so the output drops straight into a spreadsheet or a notebook.
import json import pandas as pd from bs4 import BeautifulSoup from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) def crawl(page_url): response = api.get(page_url, {"country": "US"}) if response["status_code"] == 200: return response["body"].decode("utf-8") print(f"Request failed: {response['status_code']}") return None def text_of(node, selector): found = node.select_one(selector) return found.get_text(strip=True) if found else "" def parse_listings(html): soup = BeautifulSoup(html, "html.parser") cards = soup.select("div.wp-block-tc23-post-picker") articles = [] for card in cards: title_el = card.select_one("h2.wp-block-post-title") link_el = title_el.select_one("a") if title_el else None time_el = card.select_one("time") articles.append({ "headline": title_el.get_text(strip=True) if title_el else "", "url": link_el["href"] if link_el else "", "author": text_of(card, "div.wp-block-tc23-author-card-name"), "publish_date": time_el["datetime"] if time_el else "", "category": text_of(card, "div.wp-block-tc23-post-picker__category a"), "excerpt": text_of(card, "p.wp-block-post-excerpt__excerpt"), }) return articles def main(): page_url = "https://techcrunch.com" html = crawl(page_url) if not html: return articles = parse_listings(html) print(json.dumps(articles[:3], indent=2)) pd.DataFrame(articles).to_csv("techcrunch_listing.csv", index=False) print(f"Saved {len(articles)} articles") if __name__ == "__main__": main()
This is the whole scraper. It fetches the homepage feed, parses every card into a record with the six public fields, prints the first three as JSON, and writes the full set to techcrunch_listing.csv. Swap the page_url for any public listing, such as a category or tag feed, and the same parser handles it.
What the output looks like
Run the full script with python scraper.py and you get a clean structured record for each article, ready to write to JSON, CSV, or a database.
[ { "headline": "Open source tools to boost your productivity", "url": "https://techcrunch.com/2024/08/11/a-not-quite-definitive-guide-to-open-source-alternative-software/", "author": "Paul Sawers", "publish_date": "2024-08-11T09:00:00-07:00", "category": "Apps", "excerpt": "TechCrunch has pulled together some open-source alternatives to popular productivity apps." }, { "headline": "Oyo valuation crashes over 75% in new funding", "url": "https://techcrunch.com/2024/08/11/oyo-valuation-crashes-over-75-in-new-funding/", "author": "Manish Singh", "publish_date": "2024-08-11T06:07:12-07:00", "category": "Fintech", "excerpt": "The valuation of Oyo, once India's second-most valuable startup at $10 billion, has dipped to $2.4 billion." } ]
Notice that the excerpt is a short summary, not the full article body. That is deliberate. The listing card exposes a teaser, and the metadata fields around it are exactly the public signals you want for trend tracking, without copying the editorial text itself.
Looping pages and pacing requests
One listing is a demo; a real job runs across many pages. TechCrunch paginates its feeds with a simple URL pattern, so the homepage is https://techcrunch.com and the next pages are https://techcrunch.com/page/2/, https://techcrunch.com/page/3/, and so on. The shape stays the same: build each page URL, fetch it through the Crawling API, parse it with the same function, and collect the rows. Pacing between requests keeps a long run healthy.
import time def scrape_pages(num_pages=5): results = [] for page in range(1, num_pages + 1): url = "https://techcrunch.com" if page == 1 else f"https://techcrunch.com/page/{page}/" print(f"Scraping page {page}") html = crawl(url) if html: results.extend(parse_listings(html)) time.sleep(3) return results
The time.sleep call spreads requests out so you are not hammering TechCrunch in a tight loop. Because every page shares the same card structure, the parser you already wrote works across all of them without changes, and you feed the combined list into the same pandas to_csv call from the full script.
Staying unblocked
Even with a trusted IP handling the fetch, TechCrunch watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hardened target.
- Pace your requests. Hammering listing pages in a tight loop is the fastest way to get throttled. Spread requests out and vary your targets instead of crawling one feed at full speed.
- Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
- Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked and the deeper dive on how to bypass captchas while web scraping. If a particular feed is client-rendered, our guide on scraping JavaScript pages with Python explains why rendering matters. And if you would rather route your own traffic through a rotating pool instead of using the managed API, the Smart AI Proxy (also called the AI Proxy) gives you the same residential IP rotation as a drop-in proxy endpoint.
Is it legal to scrape TechCrunch?
Whether scraping TechCrunch is allowed depends on TechCrunch's terms of service, your jurisdiction, and what you do with the data. TechCrunch's terms place limits on automated access, and its content is copyrighted editorial work, so the legal picture is narrower here than on a public listings site. None of the code in this guide changes that; it just makes the technical part work. Read the TechCrunch Terms of Service and its robots.txt, and treat both as the boundary for what you collect.
The line that keeps this defensible is the difference between metadata and the articles themselves. Collecting public metadata, the headline, author, publish date, category and tags, article URL, and the short excerpt, for research or trend analysis is a far lighter use than copying full article bodies. Do not republish or redistribute the editorial text TechCrunch produces; that is copyrighted media, and reposting it runs straight into both the terms and copyright law. If you need the underlying stories at scale, the correct path is a content license or an official agreement, not a cleverer scraper.
It is also worth knowing that TechCrunch runs on WordPress, which means there is a lighter official route for much of this data. TechCrunch publishes RSS feeds and exposes a WordPress REST API at /wp-json/wp/v2/posts that returns recent posts as structured JSON, including titles, links, dates, and excerpts, without scraping the rendered page at all. Prefer those endpoints when they cover what you need, and respect any rate limits they advertise. This guide stays scoped to public listing pages and metadata; it does not cover anything behind a login, personal data, or full-text redistribution.
Key takeaways
- TechCrunch blocks scraper-shaped traffic. A plain request gets rate limited or challenged fast, so you fetch behind a trusted, rotating IP instead.
- The Crawling API handles the hard part. One call fetches the page behind a residential IP, renders JavaScript when a feed needs it, and returns finished HTML to parse.
-
BeautifulSoup does the extraction. Select every
wp-block-tc23-post-pickercard, then read the headline, URL, author, publish date, category, and excerpt from each, and expect the selectors to drift. -
Read the date from the attribute. The
datetimeattribute on the<time>tag gives a clean ISO timestamp that sorts and filters far better than the display text. - Stay on public metadata. Respect the ToS and robots.txt, prefer TechCrunch's RSS feeds and WordPress REST API, and never republish full article bodies.
Frequently Asked Questions (FAQs)
Why does a plain request get blocked on TechCrunch?
TechCrunch sits behind an edge layer that flags automated traffic. Datacenter IPs and request patterns that do not look like a real browser get rate limited or served a challenge after a few requests, so a raw requests loop stops working quickly. Fetching through the Crawling API routes the request through a residential IP the platform reads as a real visitor, which is what keeps the run going.
Do I need the normal token or the JS token for TechCrunch?
Usually the normal token. TechCrunch listings are largely server-rendered WordPress markup, so the static HTML the normal token returns already contains the article cards. If a particular feed comes back with empty cards, switch to the JS token, which renders the page in a real browser before handing back the HTML.
What fields can I extract from a TechCrunch listing?
The public metadata on each card: the headline, the article URL, the author byline, the publish date from the <time> tag's datetime attribute, the category or tag the article is filed under, and the short excerpt shown beneath the headline. This guide stays scoped to that metadata and does not extract full article bodies, which are copyrighted.
Is there an official API instead of scraping?
Yes. TechCrunch runs on WordPress, so it publishes RSS feeds and exposes a WordPress REST API at /wp-json/wp/v2/posts that returns recent posts as structured JSON with titles, links, dates, and excerpts. Prefer those endpoints when they cover what you need, since they are the lighter, official route and require no rendering.
My selectors return empty values for every card. What changed?
Almost certainly TechCrunch's markup. WordPress block class names like wp-block-tc23-post-picker change without notice, so selectors that worked last month can break. Re-inspect a live article in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.
How do I avoid getting blocked while scraping TechCrunch?
Keep your per-IP request rate low, vary your targets instead of looping one feed, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you; if you build your own stack, that is the part to invest in. Watch the status codes and back off when you start seeing challenges.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
