Scraping a single page from a modern web app is one problem. Crawling the whole site is a different one, and it is harder than most tutorials admit. When you set out to crawl JavaScript websites built with React, Vue, Angular, or any framework that fills the page in the browser, you hit two obstacles that compound each other. Each page only shows its real content after JavaScript runs, and the navigation you would normally follow to discover more pages is itself drawn by JavaScript, so a plain HTTP fetch hands you a near-empty document with no links to walk.
This guide shows you how to build a working crawler that walks a JavaScript-rendered site end to end. You will render each page so its links and content appear, parse those links with BeautifulSoup, keep a frontier queue and a visited set so the walk terminates, and throttle politely so you stay welcome. The fetching is done through the Crawlbase Crawling API with a JavaScript token, which renders each page behind a trusted IP and returns finished HTML. For large jobs we also cover the asynchronous Crawler so you do not block on every render.
Why crawling a JS site is two problems, not one
A traditional crawler is a tight loop: fetch a URL, extract the anchors, push the new ones onto a queue, repeat. That loop assumes the HTML you fetch already contains both the content and the links. On a server-rendered site it does. On a client-rendered site it does not.
The first problem is rendering. When you request a React or Vue route with a bare HTTP client, the server returns a shell: a root <div>, a bundle of script tags, and almost nothing else. The article text, the product grid, the table you wanted, all of it is injected after the browser downloads and executes the JavaScript. No browser, no content.
The second problem is link discovery, and it is the one that quietly breaks naive crawlers. The site's navigation, pagination, and "related" links are often rendered client-side too. So even if you only wanted the links and not the content, a plain fetch still gives you nothing to follow. The crawl dies on the first page because the frontier never grows past it. To crawl a JavaScript site you have to render every page, not because you always need the body, but because rendering is what makes the links exist at all.
The single rule that makes JS crawling work: render each page before you look for links. The content and the navigation appear in the same render pass, so once you have finished HTML you can extract both the data you want and the URLs to follow next, with the same parser.
What you will build
A breadth-first crawler in Python that starts from a seed URL on a JavaScript-rendered site and walks outward, staying inside one domain. Concretely it will:
- Render each page through the Crawling API with a JS token, so content and links are both present.
- Extract links from the rendered HTML with BeautifulSoup and normalize them to absolute, same-domain URLs.
- Manage a frontier queue of URLs to visit and a visited set so nothing is fetched twice and the walk terminates.
- Throttle politely with a delay between requests and a cap on how many pages it visits.
Prerequisites
You need a few things in place before writing any code. None take long.
Basic Python. You should be comfortable running a script and installing packages with pip. If queues and sets are familiar, you are ready.
Python 3.8 or later. Confirm with python --version. Install from python.org if you do not have it.
A Crawlbase account and JS token. Sign up, open your dashboard, and copy your JavaScript (JS) token from the account docs page. The JS token is the one that renders pages in a real browser; the normal token only fetches static HTML and would hand you the same empty shell a plain fetch returns. Keep the token out of version control.
Set up the project
Create a virtual environment so dependencies stay isolated, then install the two libraries the crawler needs.
python --version python -m venv crawler_env source crawler_env/bin/activate pip install crawlbase beautifulsoup4
On Windows, activate with crawler_env\Scripts\activate instead of the source line. The crawlbase package is the official client for the Crawling API, and beautifulsoup4 parses the returned HTML so you can extract both anchors and content.
Step 1: Render a single page and confirm the links appear
Before building the loop, prove the hard part works: that rendering a client-side page surfaces links a plain fetch would miss. Initialize the client with your JS token and request one URL, asking the API to wait for asynchronous content.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"}) def render(page_url): options = {"ajax_wait": "true", "page_wait": 5000} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("utf-8") print(f"Request failed: {response['status_code']}") return None if __name__ == "__main__": html = render("https://example.com/") print(len(html) if html else "No HTML returned")
The two wait options matter for client-rendered targets. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering elements appear before capture. Five seconds is a reasonable start; raise it if a page's links come back empty. Compare the length of this rendered body against a plain requests.get on the same URL and you will usually see the rendered version is far larger, because the navigation and content are now present.
Step 2: Extract and normalize links
With rendered HTML in hand, pull the anchors out and turn them into clean, absolute URLs you can compare and queue. Two details keep the crawl sane: resolve relative hrefs against the page they came from, and strip URL fragments so /page and /page#section are not treated as two pages.
from urllib.parse import urljoin, urldefrag, urlparse from bs4 import BeautifulSoup def extract_links(html, base_url, domain): soup = BeautifulSoup(html, "html.parser") links = set() for a in soup.select("a[href]"): href = urljoin(base_url, a["href"]) href, _ = urldefrag(href) parsed = urlparse(href) if parsed.scheme in ("http", "https") and parsed.netloc == domain: links.add(href) return links
The same-domain check (parsed.netloc == domain) keeps the crawler from wandering off onto external sites, which is the difference between crawling one site and accidentally trying to crawl the whole web. Returning a set deduplicates links found multiple times on a single page. Because you extracted these from rendered HTML, they include the links that JavaScript drew, which is exactly what a plain-fetch crawler would have missed.
Crawling a JS site means rendering every page behind a trusted IP, over and over, without getting blocked. The Crawling API takes a JS token, runs each page in a real browser, rotates residential IPs server-side, and returns finished HTML so both content and links are present. You skip running a headless browser fleet and a proxy pool yourself. Start on the free tier and point it at a seed URL.
Step 3: Manage the frontier and visited set
Now the core of any crawler: a frontier of URLs waiting to be visited and a visited set of URLs already seen. Without the visited set a real site full of mutual links would loop forever; without a page cap a large site would run until you run out of budget. Both guardrails belong in every crawler you write.
import time from collections import deque def crawl_site(seed_url, max_pages=50, delay=2.0): domain = urlparse(seed_url).netloc frontier = deque([seed_url]) visited = set() pages = [] while frontier and len(visited) < max_pages: url = frontier.popleft() if url in visited: continue visited.add(url) html = render(url) if not html: continue pages.append({"url": url, "html": html}) print(f"[{len(visited)}] crawled {url}") for link in extract_links(html, url, domain): if link not in visited: frontier.append(link) time.sleep(delay) return pages
A deque used with popleft gives you breadth-first traversal, so the crawler fans out across the site rather than diving deep down one branch. Marking a URL visited the moment you pop it (not after the fetch) means a page that fails to render still counts as seen, so a flaky URL cannot trap the loop. The max_pages cap and the delay between requests are your two politeness levers; tune them to the site and your own budget.
Before crawling at any volume, read the target's robots.txt and honor its disallow rules and crawl-delay. Python's standard library urllib.robotparser can check a URL against the rules in a few lines. Polite pacing and staying out of disallowed paths is what keeps a crawler welcome rather than blocked.
Step 4: Put it together
Wire the renderer, the link extractor, and the frontier loop into one runnable script. This version also pulls the page title from each rendered page so you can see real content coming back, proof that rendering is doing its job across the whole walk.
import json import time from collections import deque from urllib.parse import urljoin, urldefrag, urlparse from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_JS_TOKEN"}) def render(page_url): options = {"ajax_wait": "true", "page_wait": 5000} response = api.get(page_url, options) if response["status_code"] == 200: return response["body"].decode("utf-8") print(f"Request failed: {response['status_code']}") return None def extract_links(html, base_url, domain): soup = BeautifulSoup(html, "html.parser") links = set() for a in soup.select("a[href]"): href, _ = urldefrag(urljoin(base_url, a["href"])) parsed = urlparse(href) if parsed.scheme in ("http", "https") and parsed.netloc == domain: links.add(href) return links def title_of(html): soup = BeautifulSoup(html, "html.parser") return soup.title.get_text(strip=True) if soup.title else None def crawl_site(seed_url, max_pages=50, delay=2.0): domain = urlparse(seed_url).netloc frontier = deque([seed_url]) visited = set() results = [] while frontier and len(visited) < max_pages: url = frontier.popleft() if url in visited: continue visited.add(url) html = render(url) if not html: continue results.append({"url": url, "title": title_of(html)}) print(f"[{len(visited)}] {url}") for link in extract_links(html, url, domain): if link not in visited: frontier.append(link) time.sleep(delay) return results def main(): pages = crawl_site("https://example.com/", max_pages=25) with open("crawl.json", "w") as f: json.dump(pages, f, indent=2) print(f"Crawled {len(pages)} pages") if __name__ == "__main__": main()
Run it with python crawler.py and you will watch the frontier grow as each rendered page contributes new links, then shrink as the page cap is reached. The output is a JSON file of every URL the crawler visited with its title. Swap title_of for a real extraction function and you have a full content crawler. If you want a deeper walkthrough of parsing the body of a single rendered page, see how to scrape JavaScript pages with Python.
Scaling up with the asynchronous Crawler
The synchronous loop above is perfect for tens or low hundreds of pages, but it has a structural ceiling: it blocks on every render. Each page waits for the API to finish a full browser render before the next request even starts, so a five-second render across a thousand pages is well over an hour of wall-clock time spent waiting, most of it idle.
For larger jobs, switch to the asynchronous Crawler. Instead of fetching one page and waiting, you push URLs into the Crawler, and Crawlbase renders them on its own infrastructure and delivers the finished HTML to a webhook callback you control. Your code stops being a render-and-wait loop and becomes two decoupled halves: a submitter that feeds URLs as fast as you discover them, and a receiver that ingests rendered pages, extracts links, and submits the new ones back. You crawl at the throughput of the Crawler's fleet, not the latency of a single render.
The crawling logic you already wrote carries straight over. The frontier, the visited set, the same-domain check, and the link extraction are identical; only the transport changes from a blocking api.get call to a submit-and-callback flow. For a complete pattern, see extract data using the Crawlbase Crawler. If your stack is on the JVM rather than Python, the same frontier-and-visited design maps cleanly onto building a web crawler in Java.
Common pitfalls when crawling JS sites
A few failure modes show up again and again. Knowing them up front saves a lot of debugging.
-
Empty link sets. If
extract_linksreturns nothing on a page you know has navigation, the page probably had not finished rendering. Raisepage_wait, and keepajax_waiton, so late-injected anchors are present when you parse. - Infinite frontiers. Calendars, faceted filters, and session-id query strings generate endless unique URLs. Normalize away tracking parameters and consider skipping URLs past a depth limit so the crawl actually finishes.
-
Crawling off-site. Without the same-domain guard, one external link turns your site crawl into a runaway. Always filter on
netloc. -
Hammering the server. No delay means a burst of requests that looks like an attack and earns a block. Keep a sane
delayand respect any crawl-delay in robots.txt.
If you would rather route your own headless-browser traffic through a rotating residential pool instead of using the managed API, the Smart Proxy gives you the same IP rotation as a drop-in proxy endpoint, and you handle rendering yourself.
Key takeaways
- Crawling a JS site is two problems. Each page needs rendering to show content, and the links you follow are JS-built too, so you must render every page to discover the next ones.
-
Render before you parse. The Crawling API with a JS token plus
ajax_waitandpage_waitreturns finished HTML, so content and links arrive together. - A frontier and a visited set are mandatory. A breadth-first queue, a seen-URL set, a same-domain filter, and a page cap are what make the walk terminate.
- Be polite. Delay between requests, honor robots.txt, and normalize URLs so the crawler does not loop on tracking parameters.
- Scale with the async Crawler. For large jobs, submit URLs and receive rendered pages via callback so you crawl at fleet throughput instead of blocking on each render.
Frequently Asked Questions (FAQs)
Why does a plain crawler stop after the first page on a JavaScript site?
Because the navigation links are rendered client-side. A bare HTTP fetch returns a shell with the scripts but none of the anchors the framework draws after it runs, so your link extractor finds nothing to queue and the frontier never grows. Rendering each page first is what makes those links exist, which is why crawling a JS site requires rendering even when you only care about discovering URLs.
Do I need the normal token or the JS token to crawl a JavaScript site?
The JS token. The normal token fetches static HTML, which on a client-rendered site is the empty shell with no content and no rendered links. The JS token runs the page in a real browser before returning the HTML, so both the data and the navigation are present for your parser and your frontier.
How do I stop the crawler from looping forever?
Keep a visited set and check it before every fetch, and mark a URL visited the moment you pop it off the frontier rather than after it succeeds. Add a max_pages cap and a same-domain filter. Together these guarantee the walk terminates even on a site where every page links to every other page.
How is crawling different from scraping a single JS page?
Scraping a single page is one render plus one parse for the fields you want. Crawling is that same render-and-parse repeated across many pages, plus the extra machinery of discovering links, queuing them, deduplicating, and pacing the walk. The rendering technique is shared; crawling adds the frontier, the visited set, and politeness controls on top.
When should I use the asynchronous Crawler instead of a synchronous loop?
Switch to the async Crawler when blocking on each render becomes the bottleneck, typically once you are crawling more than a few hundred pages. Instead of waiting for every render in sequence, you submit URLs and receive finished pages via webhook callbacks, so you crawl at the throughput of Crawlbase's fleet rather than the latency of one render at a time.
How do I crawl politely without getting blocked?
Add a delay between requests, cap how many pages you visit per run, and read the site's robots.txt to honor its disallow rules and crawl-delay. Route requests through rotating residential IPs, which the Crawling API handles for you, so no single address trips a rate limit. Watch the status codes and back off when challenges start appearing.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
