How to Scrape YouTube Data: for content and SEO research

YouTube is the second most visited site on the web, and the data it shows publicly is a research goldmine for anyone working on content and SEO. The titles that rank for a search query, the channels that own a topic, the view counts that signal demand, the publish dates that show how fresh the winning results are: all of it is visible on a public results page, and all of it tells you what an audience actually clicks. This guide shows you how to scrape public YouTube data with Python and turn it into keyword and content research you can act on.

To be clear up front: everything here stays scoped to public data from public search and video pages. You will collect video titles, channel names, view counts, publish dates, and video links. You will not touch anything behind a login, individual user comments, private playlists, or the personal data of viewers. If your project needs sanctioned, structured access at scale, the official YouTube Data API is the correct tool, and the legality section near the end explains why. This article complements our deeper walkthrough on the YouTube channel scraper; here the focus is the search-and-video layer for optimization research.

What you will build

A small Python script that takes a public YouTube search query or video URL, fetches the fully rendered page through the Crawling API with a JavaScript token, and parses a handful of public, non-personal fields for each result:

Title the video title, which tells you how a topic is framed to rank and earn clicks.
Channel the channel name that published the video, useful for spotting who owns a topic.
Views the public view count, a rough demand signal for a keyword or theme.
Publish date the relative or absolute upload date, which shows how fresh the ranking results are.
Link the canonical watch URL for each video, so you can revisit or enrich it later.

Notice what is deliberately absent: no commenter handles, no subscriber-only content, no personal data of viewers. You are gathering aggregate signals about content, not profiles of people.

Why a plain request fails on YouTube

Request a YouTube search URL with a bare HTTP client and you will get a response that is technically a success and practically empty. YouTube renders its results client-side: the initial HTML is a thin shell, and the real list of videos only appears after the page's JavaScript runs in a browser and hydrates the results. On top of that, YouTube watches for automated traffic. Datacenter IP ranges, missing browser behavior, and repetitive request patterns get challenged or rate-limited well before the interesting content loads.

So a working YouTube data scraper needs two things in the same request: a real browser that renders the page, and an IP the platform reads as an ordinary visitor. You can assemble that yourself with a headless browser and a pool of rotating residential proxies, but keeping that stack healthy is most of the work. The Crawling API folds both into one call: you send a URL with a JavaScript token, it renders the page behind a trusted residential IP, and it returns finished HTML you can parse. For the background on why rendering matters, see our guide on how to crawl JavaScript websites.

Why the JS token

Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. YouTube search and video pages are client-side rendered, so you need the JS token here. The normal token returns the same shell a plain fetch would, with nothing useful to parse.

Prerequisites

A few things to have in place first. None take long.

Basic Python. You should be comfortable running a script and installing packages with pip. If you are new to parsing HTML, our primer on how to use BeautifulSoup in Python covers the extraction side.

Python 3.8 or later. Confirm with python --version. If you do not have it, install it from python.org.

A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token. The free tier includes up to 20,000 requests and you pay only for successful ones, which is plenty for the research runs in this guide. Treat the token like a password and keep it out of version control.

Set up the project

Create an isolated virtual environment, then install the libraries the scraper needs.

bash

python --version

python -m venv youtube_env
source youtube_env/bin/activate

pip install crawlbase beautifulsoup4

On Windows, activate with youtube_env\Scripts\activate instead of the source line. Two dependencies do the work: crawlbase is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can pull out individual fields by selector. The standard-library json and csv modules handle export, so there is nothing extra to install for that.

Step 1: Fetch a rendered search page

Start by getting the finished page. Import CrawlingAPI, initialize it with your JS token, and request a public search results URL. Build the query into the standard results?search_query= path, and check the status code before parsing so failures stay loud instead of silent.

python

from urllib.parse import quote_plus
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def crawl(page_url):
    options = {"ajax_wait": "true", "page_wait": 5000}
    response = api.get(page_url, options)
    if response["status_code"] == 200:
        return response["body"].decode("utf-8")
    print(f"Request failed: {response['status_code']}")
    return None

def search_url(query):
    return f"https://www.youtube.com/results?search_query={quote_plus(query)}"

if __name__ == "__main__":
    query = "data scraping tutorial"
    html = crawl(search_url(query))
    print(html[:500] if html else "No HTML returned")

The two wait options matter for a client-rendered target. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so the late-rendering results list appears before the page is captured. Five seconds is a reasonable starting point; raise it if results come back empty. The query mirrors the legacy example ("data scraping tutorial") so you can compare output directly. Run the script and you should see real YouTube markup, which confirms rendering works before you write a single selector.

Crawlbase Crawling API

YouTube needs a rendered results page behind a trusted IP, in one call. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless browser fleet and a proxy pool yourself. Point it at a public search query on the free tier first.

Start free

Step 2: Parse the public fields

With rendered HTML in hand, the most stable signal on a YouTube page is the embedded ytInitialData JSON object that the page ships with its scripts. It carries the same fields YouTube uses to render the results list: titles, channel names, view-count text, published-time text, and video IDs. Parsing that object is far more durable than chasing deeply nested, frequently renamed CSS classes. Load the HTML into BeautifulSoup, pull the script that defines ytInitialData, and walk it for video renderers.

python

import json
import re
from bs4 import BeautifulSoup

def load_initial_data(html):
    soup = BeautifulSoup(html, "html.parser")
    for script in soup.find_all("script"):
        text = script.string or ""
        if "ytInitialData" in text:
            match = re.search(r"ytInitialData\s*=\s*(\{.*?\});", text, re.DOTALL)
            if match:
                return json.loads(match.group(1))
    return {}

def text_of(node):
    if not node:
        return None
    if "simpleText" in node:
        return node["simpleText"]
    runs = node.get("runs", [])
    return "".join(r["text"] for r in runs) if runs else None

The load_initial_data helper isolates the JSON object with a non-greedy regex and parses it. The text_of helper handles YouTube's two text shapes: some fields are a plain simpleText string, others are a list of runs that you join. With those two helpers in place, extracting each video becomes a simple walk over the search renderers.

Step 3: Extract one record per video

YouTube nests search results under a long path of section and item renderers. Each playable result is a videoRenderer that carries the title, the channel name (the ownerText or longBylineText), the viewCountText, the publishedTimeText, and the videoId. Walk the structure, collect every videoRenderer, and map each one to a flat record.

python

def find_video_renderers(node, found):
    if isinstance(node, dict):
        if "videoRenderer" in node:
            found.append(node["videoRenderer"])
        for value in node.values():
            find_video_renderers(value, found)
    elif isinstance(node, list):
        for item in node:
            find_video_renderers(item, found)
    return found

def parse_search(html):
    data = load_initial_data(html)
    renderers = find_video_renderers(data, [])
    results = []
    for v in renderers:
        video_id = v.get("videoId")
        if not video_id:
            continue
        channel = text_of(v.get("ownerText")) or text_of(v.get("longBylineText"))
        results.append({
            "title": text_of(v.get("title")),
            "channel": channel,
            "views": text_of(v.get("viewCountText")),
            "published": text_of(v.get("publishedTimeText")),
            "link": f"https://www.youtube.com/watch?v={video_id}",
        })
    return results

The recursive find_video_renderers walk avoids hardcoding the exact nesting path, which YouTube reorders from time to time; it simply collects every videoRenderer wherever it appears. Each record carries exactly the five public fields you set out to gather: title, channel, views, published date, and link. These are content and demand signals, not personal data about any viewer.

Selectors drift

YouTube changes its markup and internal field names without notice, which is why this code leans on the ytInitialData object and renderer names rather than brittle nested CSS classes. When a field comes back as None, re-inspect the live page in your browser's dev tools and update the key. Periodic maintenance is normal for any production scraper, not a sign something is broken.

Step 4: Put it together and export JSON and CSV

Now wire fetch, parse, and export into one runnable script. It runs a list of research queries, collects the public fields for each, and writes both a JSON file and a CSV file so the data drops straight into a spreadsheet or a notebook.

python

import csv
import json
import time
from urllib.parse import quote_plus
from crawlbase import CrawlingAPI

api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"})

def main():
    queries = ["data scraping tutorial", "python web scraping"]
    rows = []
    for query in queries:
        html = crawl(search_url(query))
        if not html:
            continue
        for record in parse_search(html)[:10]:
            record["query"] = query
            rows.append(record)
        time.sleep(3)

    with open("youtube_research.json", "w", encoding="utf-8") as f:
        json.dump(rows, f, indent=2, ensure_ascii=False)

    fields = ["query", "title", "channel", "views", "published", "link"]
    with open("youtube_research.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(rows)

    print(f"Saved {len(rows)} videos across {len(queries)} queries")

if __name__ == "__main__":
    main()

The time.sleep(3) between queries is not decoration. Pacing is the single biggest factor in whether a run stays healthy, and we will come back to it. The slice [:10] mirrors the top-10 results the legacy script printed and keeps the demo focused. Combine this with the earlier crawl, search_url, and parse_search functions in one file and it runs end to end.

What the output looks like

Run the full script and you get a clean record per video, ready to sort by views, group by channel, or scan for the title patterns that win a query.

json

[
  {
    "title": "Web Scraping Tutorial | Data Scraping from Websites to Excel",
    "channel": "Data Analytics",
    "views": "1.2M views",
    "published": "2 years ago",
    "link": "https://www.youtube.com/watch?v=aClnnoQK9G0",
    "query": "data scraping tutorial"
  },
  {
    "title": "Beginners Guide To Web Scraping with Python",
    "channel": "Coding Channel",
    "views": "480K views",
    "published": "1 year ago",
    "link": "https://www.youtube.com/watch?v=QhD015WUMxE",
    "query": "data scraping tutorial"
  }
]

The view counts and published-time strings come straight from YouTube as display text ("1.2M views", "2 years ago"). For analysis, normalize them in a later pass: strip "views" and expand the M and K suffixes into integers, and convert relative dates into approximate absolute ones. Keeping the raw strings in the export means you never lose the original signal.

Turning the data into content and SEO research

The point of this scrape is not the raw rows, it is what they tell you about a topic. A few practical reads:

Title patterns that rank. Group the top results per query and look at how the winning titles are phrased: the modifiers, the brackets, the numbers, the promise. That is the language an audience clicks on for that keyword.
Demand by view count. Sort by views to see which sub-topics pull the most attention. High view counts on older videos with no recent challenger often signal an opening for fresh content.
Freshness gaps. The publish date column shows how old the ranking results are. A query dominated by videos from years ago is a candidate for an up-to-date take.
Topic ownership. Counting how often each channel appears across your queries shows who already owns a theme, which informs both competitive analysis and partnership ideas.

This pairs naturally with your wider keyword work. If you are building a research pipeline, our guides on using data to improve SEO and how to extract and analyze Google SEO data cover how to fold a source like this into a full picture.

Scaling and pagination

One search page returns the first batch of results, which is usually enough for research. If you need more depth, run a broader set of queries rather than trying to paginate a single one: a list of related keywords gives you wider coverage than scrolling one result set, and it maps better onto how you actually plan content. To enrich a specific video, fetch its watch URL with the same crawl function and parse its page for the description and exact metadata, again using the embedded JSON rather than fragile selectors.

Keep volume proportional to the research question. You rarely need every result for a query; the top results carry most of the signal, and a small, well-chosen set of queries beats an exhaustive crawl for both quality and good citizenship.

Staying unblocked

Even with rendering handled by the Crawling API, YouTube watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any heavily defended target.

Pace your requests. Hammering pages in a tight loop is the fastest way to get throttled. Add real delays, as in the time.sleep above, and resist the urge to parallelize aggressively.
Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you build your own stack, this is the part to get right.
Read the status codes. A run that starts returning challenges or errors is telling you the current rate or IP tier is no longer enough. Back off rather than pushing harder.
Keep volume low and queries varied. Content research does not require crawling YouTube exhaustively. Sample the queries that matter and stop.

For the broader playbook, see our guide on how to scrape websites without getting blocked and the deep dive on how to scrape JavaScript pages with Python.

Is it legal to scrape YouTube?

This is the section to read before you write production code. YouTube is owned by Google, and its Terms of Service restrict automated access and place clear limits on collecting data from the platform. Automated scraping can run against those terms regardless of how careful your tooling is, and none of the code above changes that. It only makes the technical part work. Read YouTube's Terms of Service and its robots.txt, respect any rate limits, and treat both as the boundary for what you collect.

The honest, restrictive rules to hold to. Collect only public data that anyone can see without logging in: video titles, channel names, view counts, publish dates, and video links, exactly the aggregate, content-level signals this guide gathers. Do not scrape anything behind a login, private or unlisted videos, members-only content, or individual user comments and the handles attached to them. Treat comments, usernames, and any viewer detail as personal data; where personal data is involved, privacy regimes such as the GDPR and CCPA apply, which means you need a lawful basis and you must honor deletion requests. Never bypass authentication or download copyrighted media to redistribute it. Those are bright lines, and this guide stays on the public, non-personal side of all of them by design.

For any real, ongoing, or commercial use, the right tool is the official YouTube Data API. It is the sanctioned route Google provides, gives you guaranteed structure for titles, view counts, channel data, and search, and keeps you inside the platform's terms with a clear quota. This article is a technical walkthrough scoped narrowly to public, non-personal data for content and SEO research. It is not an endorsement of mass data collection, and it does not cover anything behind a login. If your project needs more than a small sample of public fields, the Data API or a formal agreement is the correct path, not a cleverer scraper.

Recap

Key takeaways

YouTube is client-side rendered and bot-defended. A plain fetch returns an empty shell, so you must render the page before you parse it.
Rendering and a trusted IP belong in one call. The Crawling API with a JS token does both; ajax_wait and page_wait control how long it waits for content.
Parse the embedded JSON. The ytInitialData object and renderer names are far more durable than brittle nested CSS classes.
Five public fields drive the research. Title, channel, views, publish date, and link are content and demand signals, not personal data about viewers.
Pace, rotate, and prefer the Data API. Keep volume low, lean on residential rotation, and use the official YouTube Data API for anything real or commercial.

Frequently Asked Questions (FAQs)

Why does a plain fetch return no data from YouTube?

Because YouTube renders its search and video content client-side with JavaScript. The initial HTML is a shell that only fills in after the page's scripts run in a browser, so a raw HTTP request returns a near-empty body. To get the real public results you have to render the page first, which is what the Crawling API's JS token handles for you.

Do I need the normal token or the JS token for YouTube?

The JS token. The normal token fetches static HTML, which on YouTube is the same empty shell a plain fetch returns. The JS token renders the page in a real browser before handing back the HTML, so the embedded ytInitialData object and the results it describes are present when you parse them.

What YouTube data is safe to scrape for SEO research?

Public, non-personal, content-level fields: video titles, channel names, public view counts, publish dates, and video links from public search and video pages. Those are the signals that tell you how a topic ranks and what an audience clicks. Individual user comments, the handles attached to them, private or members-only content, and anything behind a login are off limits, because they are personal data or restricted by the platform's terms.

How do I turn the scraped data into keyword research?

Group the results by query and study the patterns. Title phrasing across the top results shows the language that ranks for a keyword; view counts rank sub-topics by demand; publish dates expose freshness gaps you can fill; and channel frequency shows who already owns a theme. Export to CSV and the analysis is a few spreadsheet sorts away.

Should I use the official YouTube Data API or scrape the site?

For any real, ongoing, or commercial use, use the official YouTube Data API. It is the sanctioned route, gives guaranteed structure, includes a clear quota, and keeps you inside Google's terms. Scraping a small sample of public fields with the approach here fits lightweight, one-off content research where no API access is in place, as long as you respect the terms, robots.txt, and rate limits.

How do I avoid getting blocked while scraping YouTube?

Keep your per-IP request rate low, add real delays between requests, vary your queries instead of crawling one exhaustively, and route through rotating residential IPs so no single address trips a rate limit. The Crawling API manages rotation and a trusted IP pool for you. Watch the status codes and back off the moment you start seeing challenges.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why a plain request fails on YouTube

Prerequisites

Set up the project

Step 1: Fetch a rendered search page

Step 2: Parse the public fields

Step 3: Extract one record per video

Step 4: Put it together and export JSON and CSV

What the output looks like

Turning the data into content and SEO research

Scaling and pagination

Staying unblocked

Is it legal to scrape YouTube?

Key takeaways

Frequently Asked Questions (FAQs)

Why does a plain fetch return no data from YouTube?

Do I need the normal token or the JS token for YouTube?

What YouTube data is safe to scrape for SEO research?

How do I turn the scraped data into keyword research?

Should I use the official YouTube Data API or scrape the site?

How do I avoid getting blocked while scraping YouTube?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

How to Scrape Google People Also Ask: full PAA extraction guide

Introducing the New Crawlbase Dashboard: a cleaner control center

13 Tips to Master Data Crawling: crawls that do not break

The infrastructure brief, in your inbox.

We use cookies

Customize cookies