A scraper pulls fields off one page you already have the URL for. A crawler is the thing that finds those URLs in the first place: it starts from a seed page, follows the links it discovers, and keeps going until it has covered the part of a site you care about. If you want to map a documentation tree, collect every product page under a category, or feed a search index, you need the second kind of program.
This guide shows you how to build a web crawler in Java from the ground up. You will use the modern HttpClient (Java 11+) to fetch pages, Jsoup to parse them and pull out links, and a frontier queue with a visited set to drive a breadth-first crawl. We add the controls that separate a toy from something you can actually run: a depth limit, a same-domain restriction, and politeness delays. Then we cover the part every real crawl runs into, JavaScript rendering and IP blocks, and how to handle it without rewriting the crawler.
Crawler vs. scraper: what is actually different
The two words get used interchangeably, but the programs have different shapes. A scraper takes a known URL, fetches it once, and extracts data. A crawler takes a seed URL, fetches it, extracts the links, and adds the new ones to a queue of pages still to visit. It repeats that loop, so the set of pages it touches grows as it runs. Extraction is still part of the job, but the defining feature is link discovery and traversal.
That difference drives every design decision below. Because a crawler discovers its own work, it needs a queue (the frontier) to hold pending URLs, a set to remember what it has already seen so it does not loop forever, and bounds so it does not wander off into the entire internet. Get those three right and the rest is fetching and parsing.
What you will build
A single runnable Java class that takes a seed URL and crawls outward within one domain. It does breadth-first traversal, stops at a configurable depth, skips pages it has already visited, and pauses politely between requests. For each page it prints the URL and title, which you can swap for any extraction you need.
- Frontier queue a FIFO queue of URLs waiting to be fetched, paired with the depth at which each was found.
-
Visited set a
Setof URLs already processed, so the crawler never fetches the same page twice. - Depth limit a hard cap on how many link-hops from the seed the crawler will follow.
- Same-domain guard a check that keeps the crawl on the host you started from.
- Politeness delay a pause between requests so you do not hammer the target.
Prerequisites
You need a few things in place before writing any code, and none of them take long.
Java 11 or later. The crawler uses the java.net.http.HttpClient API introduced in Java 11, so confirm your version with java -version. Any JDK 11+ distribution works.
Jsoup. Jsoup is the standard Java library for parsing HTML and selecting elements with CSS selectors. With Maven, add the dependency below; with Gradle or a plain classpath, pull the equivalent jar.
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.17.2</version> </dependency>
Basic Java. You should be comfortable compiling and running a class and reading a stack trace. Nothing here uses frameworks beyond the JDK and Jsoup.
Step 1: Fetch a page with HttpClient
Start with the smallest useful piece: a method that takes a URL and returns the HTML as a string. The HttpClient API uses a builder pattern, supports HTTP/2 by default, and exposes a synchronous send call that blocks until the response arrives. A request timeout keeps a slow or dead host from hanging the whole crawl.
import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; import java.time.Duration; private static final HttpClient CLIENT = HttpClient.newBuilder() .connectTimeout(Duration.ofSeconds(10)) .followRedirects(HttpClient.Redirect.NORMAL) .build(); static String fetch(String url) throws Exception { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(url)) .timeout(Duration.ofSeconds(20)) .header("User-Agent", "MyJavaCrawler/1.0") .GET() .build(); HttpResponse<String> response = CLIENT.send(request, HttpResponse.BodyHandlers.ofString()); if (response.statusCode() == 200) { return response.body(); } System.out.println("Skipping " + url + " -> " + response.statusCode()); return null; }
The single static client is intentional: HttpClient is immutable and thread-safe, so you create one and reuse it for every request rather than building one per fetch. Setting a User-Agent is good manners and often required, and checking the status code before returning the body keeps failures visible instead of feeding empty HTML into the parser.
Step 2: Parse links with Jsoup
Fetching gives you a string of HTML. To crawl, you need the links inside it. Jsoup parses the HTML into a document and lets you select elements with CSS selectors. The a[href] selector grabs every anchor with an href, and Jsoup's absUrl resolves relative links like /about into absolute URLs against the page they were found on, which is exactly what the frontier needs.
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.util.ArrayList; import java.util.List; static List<String> extractLinks(String html, String baseUrl) { List<String> links = new ArrayList<>(); Document doc = Jsoup.parse(html, baseUrl); System.out.println("Title: " + doc.title()); Elements anchors = doc.select("a[href]"); for (Element a : anchors) { String absolute = a.absUrl("href"); if (!absolute.isBlank()) { links.add(absolute); } } return links; }
Passing baseUrl to Jsoup.parse is what makes absUrl work; without it, relative hrefs resolve to nothing. The same document is where you would do your real extraction, with selectors like doc.select("h1") or doc.select(".price"). Here we print the title so you can watch the crawl move from page to page. If you want a deeper look at selecting and extracting fields in Java, the guide on web scraping with Java covers the parsing side in detail.
Step 3: Drive the crawl with a frontier and a visited set
Now the core. Breadth-first crawling means you visit pages in waves: the seed first, then everything one hop away, then two hops, and so on. A FIFO queue gives you that ordering for free. Each queue entry pairs a URL with the depth at which it was discovered, so you know when to stop following its children. A HashSet of visited URLs prevents cycles, and a same-domain check keeps the crawl bounded to one host.
import java.net.URI; import java.util.ArrayDeque; import java.util.HashSet; import java.util.Queue; import java.util.Set; static class Task { final String url; final int depth; Task(String url, int depth) { this.url = url; this.depth = depth; } } static boolean sameHost(String url, String host) { try { return host.equalsIgnoreCase(URI.create(url).getHost()); } catch (Exception e) { return false; } } static void crawl(String seed, int maxDepth, long delayMs) throws Exception { String host = URI.create(seed).getHost(); Queue<Task> frontier = new ArrayDeque<>(); Set<String> visited = new HashSet<>(); frontier.add(new Task(seed, 0)); visited.add(seed); while (!frontier.isEmpty()) { Task task = frontier.poll(); System.out.println("[" + task.depth + "] " + task.url); String html = fetch(task.url); if (html == null || task.depth >= maxDepth) continue; for (String link : extractLinks(html, task.url)) { if (sameHost(link, host) && visited.add(link)) { frontier.add(new Task(link, task.depth + 1)); } } Thread.sleep(delayMs); } }
A few details carry their weight here. visited.add(link) returns false if the URL was already present, so a single call both checks membership and records the URL, which is why the guard reads sameHost(link, host) && visited.add(link). The depth check happens after the fetch but before enqueuing children, so pages at maxDepth still get fetched and parsed, they just do not contribute new links. And Thread.sleep(delayMs) is the politeness pause: it is the difference between a crawler a site tolerates and one it blocks.
Step 4: A complete, runnable crawler
Wire the pieces into one class with a main method. Compile it with Jsoup on the classpath and run it against a seed URL, a depth, and a delay.
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.net.URI; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; import java.time.Duration; import java.util.*; public class WebCrawler { private static final HttpClient CLIENT = HttpClient.newBuilder() .connectTimeout(Duration.ofSeconds(10)) .followRedirects(HttpClient.Redirect.NORMAL) .build(); record Task(String url, int depth) {} static String fetch(String url) throws Exception { HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(url)) .timeout(Duration.ofSeconds(20)) .header("User-Agent", "MyJavaCrawler/1.0") .GET() .build(); HttpResponse<String> response = CLIENT.send(request, HttpResponse.BodyHandlers.ofString()); return response.statusCode() == 200 ? response.body() : null; } static List<String> extractLinks(String html, String baseUrl) { List<String> links = new ArrayList<>(); Document doc = Jsoup.parse(html, baseUrl); System.out.println(" Title: " + doc.title()); Elements anchors = doc.select("a[href]"); for (Element a : anchors) { String absolute = a.absUrl("href"); if (!absolute.isBlank()) links.add(absolute); } return links; } static boolean sameHost(String url, String host) { try { return host.equalsIgnoreCase(URI.create(url).getHost()); } catch (Exception e) { return false; } } static void crawl(String seed, int maxDepth, long delayMs) throws Exception { String host = URI.create(seed).getHost(); Queue<Task> frontier = new ArrayDeque<>(); Set<String> visited = new HashSet<>(); frontier.add(new Task(seed, 0)); visited.add(seed); while (!frontier.isEmpty()) { Task task = frontier.poll(); System.out.println("[" + task.depth() + "] " + task.url()); String html = fetch(task.url()); if (html == null || task.depth() >= maxDepth) continue; for (String link : extractLinks(html, task.url())) { if (sameHost(link, host) && visited.add(link)) { frontier.add(new Task(link, task.depth() + 1)); } } Thread.sleep(delayMs); } } public static void main(String[] args) throws Exception { crawl("https://books.toscrape.com/", 2, 1000); } }
That is a working breadth-first crawler in well under 100 lines. It uses a Java record for the task pair, fetches with a reused client, parses with Jsoup, stays on one host, dedupes with a set, stops at depth 2, and waits one second between requests. Point it at a sandbox like books.toscrape.com first, then swap in your own seed and extraction.
Before crawling a site you do not own, read its robots.txt and terms of service, and treat them as the boundary. Honor crawl-delay directives, skip disallowed paths, and keep your request rate low enough that you are not straining the server. A polite crawler that backs off is one that keeps working; an aggressive one gets blocked and can cause real harm.
Where a hand-rolled crawler hits a wall
The crawler above is correct, and on a static, well-behaved site it will run fine. Two things break it on the modern web, and both are common enough that you will meet them quickly.
JavaScript rendering. Your fetch returns the raw HTML the server sends. Many sites build their content in the browser with React, Angular, or Vue, so that raw HTML is a near-empty shell. Jsoup parses exactly what it is given and cannot run JavaScript, so on those pages a[href] finds few or no links and the crawl dies at the seed. You would need a headless browser to render the page first. Crawling sites that depend on client-side rendering is its own topic, covered in the guide on how to crawl JavaScript-heavy websites.
IP blocks. A crawler makes many requests from one address in a short window, which is exactly the pattern anti-bot systems look for. Datacenter IPs get challenged, rate-limited, or banned, and once your address is flagged the crawl stalls regardless of how polite your delay is. Spreading requests across a pool of residential IPs avoids the single-address signature, but building and maintaining that pool is most of the work. The broader playbook lives in how to scrape websites without getting blocked.
The scaling turn: render and rotate server-side
You can solve both problems without complicating the crawler. Instead of fetching the target URL directly, route the fetch through an API endpoint that renders the page in a real browser and serves it from a rotating residential IP, then hands you finished HTML. Your crawler logic, the frontier, the visited set, the depth limit, the same-domain guard, stays exactly the same. Only the fetch method changes.
The Crawling API is a plain HTTP GET: you pass your token and the target URL as query parameters, add a flag to request JavaScript rendering, and read the response body. Because it is just a GET, it drops straight into the HttpClient code you already wrote.
import java.net.URLEncoder; import java.nio.charset.StandardCharsets; private static final String TOKEN = "YOUR_CRAWLBASE_JS_TOKEN"; static String fetch(String url) throws Exception { String target = URLEncoder.encode(url, StandardCharsets.UTF_8); String endpoint = "https://api.crawlbase.com/?token=" + TOKEN + "&page_wait=3000&url=" + target; HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(endpoint)) .timeout(Duration.ofSeconds(90)) .GET() .build(); HttpResponse<String> response = CLIENT.send(request, HttpResponse.BodyHandlers.ofString()); return response.statusCode() == 200 ? response.body() : null; }
Two changes carry the load. The target URL is encoded with URLEncoder before going into the query string, because Crawlbase requires the inner URL to be encoded so its own parameters parse cleanly. And page_wait=3000 tells the API to wait three seconds after load for late-rendering content before capturing the HTML, which is what turns a JavaScript shell into the fully built page your parser needs. The JS token enables rendering; the rotation happens server-side, so the address the target sees is a real residential IP, not your crawler's. A longer client timeout accounts for the render step taking longer than a raw fetch.
Keep your Java crawler simple and let rendering and IP rotation happen server-side. The Crawling API takes a token and a target URL over a plain GET, runs the page in a real browser, rotates through residential IPs, and returns finished HTML, so your frontier, visited set, and parser never change. Wire it into the fetch method and crawl JavaScript-heavy sites without running a headless fleet or a proxy pool yourself.
Hardening the crawler for real runs
A handful of additions take the demo toward production. None of them change the core loop; they make it survive contact with messy sites.
-
Normalize URLs before deduping. Strip fragments (
#section) and trailing slashes so/pageand/page#topcount as one URL. Without this your visited set bloats and you re-fetch the same content. -
Filter non-HTML links. Skip hrefs ending in
.pdf,.jpg,.zip, and similar before enqueuing, so the crawler does not try to parse binaries as HTML. - Cap the total page count. Depth bounds breadth-first reach, but a wide site can still queue thousands of pages. A hard ceiling on visited size is a simple safety valve.
- Handle errors per page. Wrap each fetch-and-parse in a try/catch so one bad URL logs and moves on instead of killing the whole crawl.
- Persist the frontier. For long crawls, back the queue and visited set with a file or database so a crash does not lose progress and you can resume.
If you want true parallelism, the same structure extends to a thread pool: swap the HashSet for a ConcurrentHashMap.newKeySet() and the ArrayDeque for a ConcurrentLinkedQueue, then run several worker threads pulling from the frontier. Keep the politeness delay per host so concurrency does not turn into a flood. For workloads where you submit many URLs and collect results as they finish rather than blocking on each, the asynchronous Crawler handles the queueing and callbacks for you.
Key takeaways
- A crawler discovers its own URLs. The defining parts are a frontier queue, a visited set, and bounds, not the extraction.
-
HttpClient plus Jsoup is the core stack. Java 11's
HttpClientfetches, Jsoup parses HTML and resolves links withabsUrl. -
Breadth-first needs three guards. A FIFO frontier with depth, a
HashSetof visited URLs, and a same-domain check keep the crawl bounded and loop-free. -
Be polite. Delay between requests, set a real
User-Agent, and respectrobots.txtand terms of service. - JS rendering and IP blocks are where DIY breaks. Routing the fetch through the Crawling API handles both server-side and leaves the crawler logic untouched.
Frequently Asked Questions (FAQs)
What is the difference between a web crawler and a web scraper?
A scraper extracts data from URLs you already have. A crawler starts from a seed URL, follows the links it finds, and discovers new pages as it runs, so the set of pages it visits grows over time. Extraction is still part of crawling, but the defining features are link discovery and traversal, which is why a crawler needs a frontier queue and a visited set that a single scraper does not.
Should I use HttpClient or Jsoup to fetch pages in Java?
Use both for what each does best. Java 11's HttpClient gives you fine control over the request: timeouts, headers, redirect policy, and synchronous or asynchronous sends. Jsoup is built for parsing the returned HTML and selecting elements with CSS selectors. Jsoup can fetch on its own, but pairing a dedicated HTTP client with Jsoup as the parser keeps the two concerns separate and is easier to extend, such as routing fetches through an API later.
How do I keep my crawler from looping forever?
Maintain a Set of URLs you have already enqueued and check it before adding any new link. In the example, visited.add(link) returns false when the URL is already present, so a single call both tests membership and records the URL. Combined with a depth limit and a same-domain guard, that prevents cycles and stops the crawl from wandering across the whole web.
Why does my Java crawler return empty or missing links?
The most likely cause is JavaScript rendering. Your fetch returns the raw HTML the server sends, and many sites build their content in the browser with frameworks like React or Angular, so that raw HTML is a near-empty shell with few or no anchors. Jsoup cannot run JavaScript, so it finds nothing to follow. Render the page first with a headless browser, or route the fetch through the Crawling API with rendering enabled so the HTML is fully built before you parse it.
How do I avoid getting blocked while crawling?
Pace your requests with a delay, set an honest User-Agent, respect robots.txt and crawl-delay directives, and keep per-host volume reasonable. The harder problem is IP reputation: many requests from one datacenter address get flagged fast. Spreading requests across rotating residential IPs avoids that signature. The Crawling API handles rotation for you; if you build your own stack, that is the part to invest in.
Can I make a Java crawler run in parallel?
Yes. The breadth-first structure extends to multiple threads: replace the HashSet with ConcurrentHashMap.newKeySet() and the ArrayDeque with a thread-safe queue like ConcurrentLinkedQueue, then run several workers pulling from the frontier. Keep a politeness delay per host so concurrency does not turn into a flood, and consider the asynchronous Crawler for fire-and-collect workloads where you submit many URLs and gather results as they complete.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
