Web crawling is the engine behind search engines, price monitors, and almost every large web dataset, but the word hides a lot of engineering. A crawler is not just a loop that downloads pages. It is a system that decides which links to follow, in what order, how fast, and how to avoid downloading the same page twice while still finishing in a reasonable time. Get those decisions right and a crawl scales cleanly to millions of URLs. Get them wrong and it stalls, loops, or gets your IP blocked on the first thousand requests.

This guide walks the core web crawling techniques that separate a robust crawler from a naive one, then surveys the frameworks teams reach for when they do not want to build that machinery from scratch. By the end you should understand how a crawl actually traverses the web, the trade-offs behind each technique, and which tool fits the job you have in front of you.

What is web crawling?

A web crawler, also called a spider or web robot, automatically discovers and downloads pages by following links. It starts from a set of seed URLs, fetches each page, extracts the links it finds, and adds the new ones to a queue of pages still to visit. Repeat that loop and the crawler walks outward across a site or the wider web, building a record of what it found. Search engines were the original use case: their bots index page content so it can surface in results.

Along the way a crawler collects more than raw HTML. It records each page's URL, title and meta information, body content, and the outbound links and where they point. It keeps a note of URLs already downloaded so it does not fetch the same page twice, and it can flag broken links or compare versions of a page over time. The same machinery powers practical jobs like archiving sites, building product catalogs, monitoring competitor prices, and tracking mentions across news and social sources.

Crawling and scraping are related but distinct. Crawling is the discovery and traversal step, finding and fetching the pages. Scraping is the extraction step, pulling specific fields out of the markup once you have it. Most real projects do both, but the techniques below are about the crawl: how to traverse the web efficiently, politely, and without getting stuck.

Core web crawling techniques

The techniques in this section are the decisions every serious crawler has to make, whether you write them yourself or inherit them from a framework. They cover the order pages are visited, how the queue of work is managed, how the crawler stays a good citizen of the sites it touches, and how it handles modern pages that build themselves with JavaScript.

Breadth-first vs depth-first traversal

The web is a graph, and the order you walk it changes what you collect first. Breadth-first crawling visits all the pages one link away from the seeds, then everything two links away, and so on, expanding in widening rings. It is the usual default for general crawls because it reaches a broad, shallow sample of a site quickly and tends to find high-value pages (which are often linked from many places) early. Depth-first crawling instead follows one path as far as it goes before backtracking, diving deep into a single branch before exploring siblings.

In practice breadth-first, implemented with a FIFO queue, dominates large crawls because it gives even coverage and is easy to bound. Depth-first, backed by a stack, suits cases where you want to fully exhaust one section before moving on, such as crawling a single deeply nested catalog. Many crawlers use a hybrid, prioritizing the queue by a score (link popularity, page depth, freshness) rather than strict breadth or depth, so the most useful pages are fetched first.

The URL frontier and deduplication

The queue of URLs waiting to be crawled is called the frontier, and managing it well is most of what makes a crawler scale. The frontier decides which URL comes next, enforces ordering and priority, and feeds the fetchers. At any real scale it has to live outside memory (in a database or a distributed queue) because the list of discovered URLs grows far faster than the list you have already visited.

The companion problem is deduplication. The same page is reachable through many URLs, tracking parameters, and redirect chains, so without a dedupe step a crawler downloads the same content over and over and can loop forever. The standard fix is to normalize each URL (lowercase the host, strip default ports, drop fragments and known tracking parameters) and check it against a set of URLs already seen. For very large crawls that set is often a memory-efficient structure such as a Bloom filter, which answers "have I seen this URL?" using a fraction of the memory a full list would need.

Politeness and rate limiting

A crawler that fires requests as fast as it can will overload small servers and get itself throttled or banned. Politeness is the discipline of pacing requests so the crawl does not harm the sites it visits. The core rule is a per-host delay: limit how many requests you send to any single domain per second, and add a short wait between hits on the same host, even while you crawl many other hosts in parallel.

Good politeness combines a few habits. Cap concurrency per domain rather than globally, so no single site gets flooded. Respect any Crawl-delay a site advertises, and back off when you see errors or slow responses, since a struggling server should be hit less, not more. Beyond being courteous, this is practical: gentle, well-identified traffic is far less likely to trip rate limits than an aggressive crawl, so politeness and reliability point the same direction.

Respecting robots.txt

Most sites publish a robots.txt file at their root that states which paths crawlers may and may not visit, and which user agents the rules apply to. A well-behaved crawler fetches and parses this file before crawling a host, then skips any disallowed paths. The file can also advertise a Crawl-delay and point to the site's sitemap, which is a ready-made list of URLs the site wants crawled.

Honoring robots.txt is the baseline expectation for automated traffic and the clearest signal of a responsible crawler. Cache the parsed rules per host so you are not re-fetching the file constantly, and refresh them periodically since rules change. Sitemaps are worth using directly: they often expose pages that link-following alone misses, and they hint at how fresh each URL is, which feeds the prioritization decisions above.

Handling JavaScript rendering

A growing share of the web builds its content in the browser. The HTML a plain HTTP fetch returns is nearly empty until client-side JavaScript runs and injects the real content. A crawler that only reads the initial response sees almost nothing on these pages. To crawl them you need to render the page the way a browser would, which means running a headless browser such as the one driven by Puppeteer, Playwright, or Selenium that executes the scripts and hands back the fully built DOM.

Rendering is powerful but expensive: a real browser uses far more CPU and memory than an HTTP request, so you do not want to render every page. The usual approach is to detect which targets actually need it and render only those, keeping the cheap fetch path for static pages. For a deeper look at this split, see how to crawl JavaScript websites, which covers when rendering is required and how to keep it from dominating your crawl budget.

Distributed crawling

One machine can only fetch so many pages per second. Past a certain scale the crawl has to spread across many workers, and that is distributed crawling. The frontier becomes a shared queue, multiple fetchers pull URLs from it in parallel, and the dedupe set is shared so two workers do not crawl the same page. Done right, throughput scales close to linearly with the number of workers.

The hard parts are coordination and politeness. Work has to be partitioned so that all requests to a given host route through the same worker or rate budget, otherwise ten workers each "politely" hitting one site combine into an impolite flood. State (the frontier, the seen-set, the results) has to be shared and consistent across machines. This coordination overhead is exactly why many teams hand large crawls to a managed service rather than operate a distributed cluster themselves.

Crawlbase Crawling API

Distributed crawling, rotation, rendering, and retries are the parts of a crawler that are hard to build and harder to keep running. The Crawlbase Crawling API takes a URL and handles rotating IPs, JavaScript rendering, and automatic retries on blocks, returning clean HTML so you keep your own traversal and parsing logic. For large jobs, the asynchronous Crawler lets you push URLs and receive results via callback, so you crawl at scale without managing a worker fleet or proxy pool yourself.

Incremental and focused crawling

Crawling once is rarely the whole job. The web changes, so a crawler that has already indexed a site needs to revisit it without re-downloading everything. Incremental crawling tracks what changed and re-fetches selectively, using signals like a page's last-modified date, its sitemap entry, or how often it has changed before, so frequently updated pages are revisited often and static ones are left alone. This keeps a large index fresh without paying the full cost of a complete recrawl each time.

Focused crawling narrows the other axis: instead of trying to cover everything, it pursues only pages relevant to a topic or pattern. The crawler scores each discovered link for how likely it is to lead toward the target content and prioritizes the promising ones, pruning branches that drift off-topic. A vertical price monitor, for example, follows product and category links and ignores the rest. Both techniques are about spending a finite crawl budget where it matters rather than crawling indiscriminately.

Breadth-first vs depth-first. The same link tree, visited two ways: breadth-first works level by level (wide, shallow coverage), while depth-first follows one branch to the bottom before backtracking (deep, focused crawls). The numbers show the visit order.

Web crawling frameworks

Few teams implement the frontier, dedupe, politeness, and rendering machinery from scratch. Frameworks package those techniques into reusable tooling so you configure a crawl rather than build the plumbing. The picks below are the established, widely used options, ordered roughly from the lightweight scripting end toward the heavy, search-scale systems, plus the managed approach for teams that would rather not run crawl infrastructure at all.

Scrapy

Scrapy is the most popular crawling framework in the Python ecosystem and the usual starting point for custom crawlers. It gives you the whole pipeline: an asynchronous engine that fetches many pages concurrently, a request scheduler that manages the frontier, automatic link following, retries, and built-in export of structured data to JSON, CSV, or XML. You write spiders that define where to start and how to parse each page, and Scrapy handles the concurrency and queueing underneath. It is the right choice for recurring crawls of thousands to millions of pages where you want structure and control. Vanilla Scrapy does not execute JavaScript, though it integrates with browser tools when a target needs rendering.

Apache Nutch

Apache Nutch is a mature, open-source crawler built for web-scale crawling and tight integration with the search world. It runs on top of Apache Hadoop, so its crawl is distributed across a cluster by design, and it plugs into indexing back ends like Apache Solr or Elasticsearch. Nutch is built around the classic search-engine crawl loop (generate a fetch list, fetch, parse, update the crawl database) and is extensible through a plugin system for protocols, parsers, and filters. It is heavier to operate than Scrapy and aimed at teams crawling very large portions of the web who need a battle-tested, Hadoop-backed pipeline.

Heritrix

Heritrix is the web crawler built by the Internet Archive and used to capture pages for the Wayback Machine. It is designed for thorough, archival-quality crawls and writes its output in the standard WARC format, which preserves full request and response data for long-term archiving. Heritrix is highly configurable around scope rules, politeness, and what to capture, and it respects robots.txt rigorously by default. Reach for it when faithful, complete preservation of pages is the goal, such as building a web archive, rather than extracting a few fields for analysis.

StormCrawler

StormCrawler is a collection of resources for building low-latency, scalable web crawlers on Apache Storm. Because Storm is a stream-processing system, StormCrawler crawls continuously rather than in batches, which suits use cases that need fresh data on an ongoing basis, such as news and monitoring crawls. It is modular and Java-based, letting you assemble a crawl topology from components for fetching, parsing, and indexing. It sits in similar territory to Nutch but favors continuous, real-time crawling over Nutch's batch-oriented model.

Managed crawling with Crawlbase

The frameworks above give you the crawl logic but leave the network problems to you: rotating IPs, rendering JavaScript, solving or avoiding CAPTCHAs, and retrying blocked requests. A managed crawling service absorbs that layer. With Crawlbase you send a URL and get back rendered HTML, with proxy rotation and anti-block handling done server-side, and the asynchronous Crawler queues large batches and delivers results by callback. It does not replace your crawl strategy, the seed selection, traversal, and parsing remain yours, but it removes the infrastructure that is hardest to keep running at scale.

Frameworks at a glance

The table maps each framework to what it is best at and the kind of project it fits, so you can read your own job onto it rather than defaulting to the one you used last.

Framework Best for Type
Scrapy Custom crawls, thousands to millions of pages Python framework
Apache Nutch Web-scale, Hadoop-backed search crawls Distributed crawler
Heritrix Archival, full-fidelity page capture (WARC) Archival crawler
StormCrawler Continuous, low-latency monitoring crawls Streaming crawler
Crawlbase Managed crawling without anti-block infrastructure Crawling API / async Crawler

No single row is the answer to every crawl. Scrapy covers most custom work, Nutch and StormCrawler handle web-scale and continuous crawls, Heritrix specializes in archiving, and a managed API takes over the rotation and rendering that none of the open-source frameworks solve out of the box.

Crawling responsibly

Whatever technique or framework you use, crawl with restraint. Respect each site's terms of service and its robots.txt, focus on publicly available data rather than anything behind a login you are not entitled to, and keep request rates reasonable so you do not strain the servers you depend on. Identify your crawler honestly through its user agent and provide a way to contact you. Responsible pacing is also self-interested: gentle, well-behaved traffic gets blocked far less often than an aggressive crawl, so good manners and reliable crawling tend to point the same direction.

Recap

Key takeaways

  • Traversal order matters. Breadth-first gives even, bounded coverage and is the usual default; depth-first dives into one branch, and many crawlers prioritize the frontier by a score instead.
  • The frontier and dedupe are the core. A well-managed URL queue plus URL normalization and a seen-set (often a Bloom filter at scale) keep a crawl from looping or re-downloading pages.
  • Politeness keeps you unblocked. Per-host rate limits, capped concurrency, and respecting robots.txt protect the sites you crawl and the reliability of your own crawl.
  • JavaScript and scale add cost. Render only the pages that need a browser, and distribute across workers while routing each host through one rate budget to stay polite.
  • Frameworks package the machinery. Scrapy fits most custom crawls, Nutch and StormCrawler handle web-scale and continuous jobs, Heritrix archives, and a managed API absorbs rotation and rendering.

Frequently Asked Questions (FAQs)

What is the difference between web crawling and web scraping?

Crawling is the discovery and traversal step: starting from seed URLs, following links, and fetching pages to find more pages. Scraping is the extraction step: pulling specific fields out of the markup once you have a page. Most projects do both, crawling to reach the pages and scraping to extract data from them, but they are distinct stages with different concerns.

Should a crawler use breadth-first or depth-first traversal?

Breadth-first is the common default for general crawls because it gives broad, even coverage quickly and is easy to bound with a FIFO queue. Depth-first suits cases where you want to fully exhaust one deep section before moving on. Many production crawlers use neither strictly, instead prioritizing the frontier by a score such as link popularity, depth, or freshness so the most useful pages are fetched first.

What is a URL frontier?

The frontier is the queue of URLs a crawler has discovered but not yet visited. It decides which URL comes next, enforces ordering and priority, and feeds the fetchers. At scale it usually lives in a database or distributed queue rather than memory, because the list of discovered URLs grows quickly. Paired with deduplication, it is what keeps a crawl orderly and prevents endless loops.

How do crawlers avoid downloading the same page twice?

They normalize each URL (lowercasing the host, stripping default ports, dropping fragments and tracking parameters) and check it against a set of URLs already seen before queuing it. For very large crawls that set is often a memory-efficient structure such as a Bloom filter, which can answer whether a URL has been seen using a small fraction of the memory a full list would need.

Do web crawlers have to respect robots.txt?

Honoring robots.txt is the baseline expectation for well-behaved automated traffic and the clearest mark of a responsible crawler. A good crawler fetches and parses the file before crawling a host, skips disallowed paths, respects any advertised crawl delay, and uses the sitemap it points to. Combined with reasonable rate limits and honest identification, that is the core of crawling responsibly.

Which web crawling framework should I use?

It depends on the job. Scrapy fits most custom crawls of thousands to millions of pages. Apache Nutch and StormCrawler target web-scale and continuous crawling. Heritrix is built for archival, full-fidelity capture. If the hard part is staying unblocked rather than the crawl logic, a managed crawling API handles rotation, rendering, and retries so you can focus on traversal and parsing.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available