Java is not the first language most people reach for when they think about scraping, but it is one of the most dependable. If your data pipeline, your service layer, or your batch jobs already run on the JVM, doing the extraction in Java keeps everything in one codebase, one build, and one deployment story. The ecosystem is mature: a fast HTML parser, headless browser bindings, and a strong concurrency model are all first-class.
This guide is a practical tour of web scraping with Java. We cover the toolbox you actually need (Jsoup for static pages, HtmlUnit and Selenium for JavaScript-rendered ones), walk through one clean Jsoup example end to end, and then deal with the part that breaks most scrapers at scale: anti-bot defenses and IP blocks. The goal is a tight mental model you can build on, not an exhaustive catalog.
The Java scraping toolbox
Almost every Java scraper is built from one of three tools. Picking the right one is mostly a question of how the target page delivers its content.
- Jsoup is the workhorse. It fetches a URL over HTTP and parses the returned HTML into a queryable document with CSS-selector support that feels like jQuery. It is fast, has no external runtime, and is the right default for any page whose data is present in the initial HTML.
- HtmlUnit is a GUI-less browser for Java. It executes JavaScript and builds a DOM, so it can reach content that Jsoup cannot see. It is lighter than a real browser but its JavaScript engine lags behind modern sites, so complex single-page apps can trip it up.
- Selenium WebDriver drives a real browser (Chrome, Firefox) from Java. It renders anything a user would see, which makes it the most capable option for heavy client-side rendering, at the cost of being the slowest and most resource-hungry.
The decision is simpler than the long list of "scraper types" you sometimes see. Is the data in the raw HTML? Use Jsoup. Does it only appear after scripts run? Reach for HtmlUnit, and fall back to Selenium when HtmlUnit cannot keep up. Everything else is detail.
Open the target URL, view source (not the inspector, the raw source), and search for a value you want to extract. If it is there, the page is server-rendered and Jsoup alone will do. If the source is a near-empty shell and the data only shows up in the live DOM, the page renders client-side and you need a browser engine or a rendering service.
Set up a Jsoup project
Jsoup ships as a single Maven dependency. Add it to your pom.xml and you are ready to fetch and parse.
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.17.2</version> </dependency>
If you use Gradle instead, the same coordinate goes in your build.gradle as implementation 'org.jsoup:jsoup:1.17.2'. Either way you now have a fetcher and a parser in one library.
A clean Jsoup example, start to finish
The pattern for a static page is always the same: connect to the URL, get back a parsed Document, select the elements you want with CSS selectors, and pull out fields. Here is a complete, runnable example that scrapes a list of book entries and prints a structured record for each one. Swap the URL and selectors for your own target.
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class BookScraper { public static void main(String[] args) throws Exception { String url = "https://books.toscrape.com/"; Document doc = Jsoup.connect(url) .userAgent("Mozilla/5.0 (compatible; MyScraper/1.0)") .timeout(10000) .get(); Elements books = doc.select("article.product_pod"); for (Element book : books) { String title = book.selectFirst("h3 a").attr("title"); String price = book.selectFirst("p.price_color").text(); String stock = book.selectFirst("p.instock").text().trim(); System.out.printf("%s | %s | %s%n", title, price, stock); } } }
A few things in that snippet earn their place. The userAgent call sets a sensible identifier instead of Jsoup's default, which many servers reject outright. The timeout stops a slow host from hanging your run forever. select returns every matching element, which you iterate; selectFirst returns one, which is what you want for a field inside a row. Reading the title from the title attribute rather than the link text avoids the truncated "..." that the visible text sometimes carries.
The output is one tidy line per book: the title, the price, and the availability. From here it is a short step to write each record to JSON, CSV, or straight into a database, and to follow the pagination links to walk the whole catalog. If your goal is a full multi-page crawl rather than a single fetch, the dedicated walkthrough on how to build a web crawler in Java takes this pattern further into queueing and link discovery.
The Java code above almost never changes. The selectors do. Sites rename classes and restructure markup without warning, so a scraper that worked last month can return empty results today. Keep your selectors in one place, fail loudly when an expected element is missing, and re-inspect the live page when extraction goes quiet. This is routine maintenance, not a sign the approach is wrong.
Handling JavaScript-rendered pages
Jsoup only ever sees the HTML the server sends. When a page builds its content in the browser, that initial HTML is a shell and Jsoup comes back empty. You have two ways forward.
The first is a Java browser engine. HtmlUnit runs the page's JavaScript headlessly and gives you a populated DOM you can query much like Jsoup. It is fast and has no external browser to install, but its scripting engine is not as current as a real browser, so modern frameworks can render incorrectly or throw. Selenium WebDriver sidesteps that by automating an actual Chrome or Firefox: it renders exactly what a user sees, waits for elements, and even handles interactions like clicks and scrolls. The trade-off is weight. A browser per worker eats memory and CPU, and a fleet of them is real infrastructure to run and keep healthy.
The second way is to skip the browser-on-your-side problem entirely and let a rendering service return finished HTML. That keeps your Java code as simple as the Jsoup example above while still getting fully rendered pages, which is exactly what the next section covers.
The real blocker: anti-bot defenses at scale
Rendering is the easy half of the problem. The hard half shows up the moment you run more than a handful of requests. Commercial sites watch for scraper-shaped traffic and push back with rate limits, CAPTCHAs, and outright IP bans. A datacenter IP making dozens of identical requests a minute gets flagged fast, and once an IP is blocked, every request from it fails no matter how clean your parsing code is.
The fixes are well known: rotate through many IP addresses so no single one trips a limit, prefer residential IPs that read as real users, pace your requests, vary your headers, and render pages so your traffic looks like a browser. The trouble is that assembling all of that yourself (a healthy proxy pool plus a headless browser fleet plus retry logic) is most of the engineering, and it has nothing to do with the data you actually want. For the full playbook, see how to scrape websites without getting blocked.
Route requests through the Crawling API
This is where pushing the hard parts server-side keeps your Java simple. The Crawling API is a single HTTP endpoint: you send it a target URL and your token, and it renders the page in a real browser behind a rotating residential IP, then returns the finished HTML. Your code stays a plain HTTP call followed by a Jsoup parse. No headless fleet, no proxy pool, no CAPTCHA handling on your end.
Because it is just an HTTP request, you can call it with Java's built-in HttpClient. Pass &javascript=true when the target renders client-side; drop it for static pages to save on rendering.
import java.net.URI; import java.net.URLEncoder; import java.net.http.HttpClient; import java.net.http.HttpRequest; import java.net.http.HttpResponse; import java.nio.charset.StandardCharsets; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; public class CrawlingApiScraper { private static final String TOKEN = "YOUR_CRAWLBASE_JS_TOKEN"; public static void main(String[] args) throws Exception { String target = "https://www.example.com/products"; String encoded = URLEncoder.encode(target, StandardCharsets.UTF_8); String endpoint = "https://api.crawlbase.com/?token=" + TOKEN + "&javascript=true&url=" + encoded; HttpClient client = HttpClient.newHttpClient(); HttpRequest request = HttpRequest.newBuilder() .uri(URI.create(endpoint)) .GET() .build(); HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); if (response.statusCode() != 200) { System.out.println("Request failed: " + response.statusCode()); return; } Document doc = Jsoup.parse(response.body()); Elements items = doc.select("div.product"); System.out.println("Parsed " + items.size() + " products"); } }
Notice that the parsing half is identical to the plain Jsoup example. The only change is where the HTML comes from: instead of Jsoup.connect(url).get() hitting the site directly, you fetch through the API and hand the body to Jsoup.parse(). Everything you already know about selectors carries over unchanged, while rendering, IP rotation, and block avoidance happen on the other side of that one HTTP call.
Scraping at scale fails on blocks, not on parsing. The Crawling API renders each page in a real browser behind a rotating residential IP and returns finished HTML to a single HTTP call, so your Java code stays a fetch plus a Jsoup parse. Add javascript=true for client-rendered pages and start on the free tier.
When to use a browser instead of a service
Routing through the API is not the only valid choice. If your target needs genuine interaction, logging in, clicking through multi-step flows, filling forms, dragging sliders, then driving a real browser with Selenium gives you control a single HTTP call cannot. The honest trade-off is operational cost: you own the browser fleet, the memory footprint, and the flakiness that comes with automating live UIs. For background on running browsers for extraction and where they pay off, the overview of the right headless browser for web scraping is a good companion read.
A common middle ground is to use the managed API for the bulk of straightforward page fetches, where blocks and rendering are the whole problem, and reserve a Selenium worker for the small slice of targets that genuinely require interaction. That keeps your infrastructure light without giving up the cases that need a full browser.
Good habits for production Java scrapers
Whichever tool you land on, a few practices separate a script that works once from a scraper that runs reliably for months.
- Set a real user agent and timeout. Default identifiers get rejected, and an unbounded request can stall an entire batch.
- Fail loudly on missing fields. A null where you expected an element is your earliest signal that selectors drifted. Catch it and log the URL.
- Pace and retry with backoff. Space requests out, and on a transient error wait longer before retrying instead of hammering.
- Separate fetching from parsing. A clean boundary means you can switch from a direct Jsoup fetch to the Crawling API without touching your extraction code.
- Persist as you go. Write each record as it is parsed rather than buffering a long run in memory, so a crash near the end does not lose everything.
Key takeaways
- Jsoup is the default. For any page whose data lives in the raw HTML, connect, select with CSS selectors, and extract. It is fast and dependency-light.
- JavaScript pages need a browser engine or a renderer. HtmlUnit and Selenium render client-side content in Java; a rendering service returns finished HTML without you running one.
- Blocks, not parsing, break scrapers at scale. Rate limits, CAPTCHAs, and IP bans are the real obstacle, and rotating residential IPs plus pacing are the answer.
- The Crawling API keeps Java simple. One HTTP call handles rendering and IP rotation server-side, so your code stays a fetch plus a Jsoup parse.
- Selectors are the fragile layer. Expect them to drift, fail loudly when they do, and treat re-inspecting the live page as routine maintenance.
Frequently Asked Questions (FAQs)
Can you do web scraping with Java?
Yes. Java has a mature scraping ecosystem: Jsoup fetches and parses static HTML with CSS selectors, HtmlUnit and Selenium handle JavaScript-rendered pages, and the built-in HttpClient lets you call rendering or proxy services. If your stack already runs on the JVM, scraping in Java keeps extraction in the same codebase as the rest of your pipeline.
Which Java library is best for web scraping?
It depends on the page. Jsoup is the best default for static, server-rendered HTML because it is fast and simple. For pages that build their content in the browser, HtmlUnit runs JavaScript headlessly and Selenium WebDriver drives a real browser for the heaviest single-page apps. Most projects use Jsoup for parsing and add a browser engine or a rendering service only where the page demands it.
When does Jsoup stop being enough?
Jsoup only sees the HTML the server returns, so it comes back empty on pages that render their data client-side with JavaScript. The quick test is to view the raw page source and search for a value you want: if it is missing from the source but present in the live DOM, you need a browser engine like HtmlUnit or Selenium, or a rendering service that returns finished HTML for Jsoup to parse.
How do I avoid getting blocked while scraping with Java?
Rotate through many IP addresses so no single one trips a rate limit, prefer residential IPs that read as real users, pace your requests, vary your headers, and render pages so the traffic looks like a browser. Building that yourself is most of the work, which is why many teams route requests through the Crawling API, which handles rotation, rendering, and block avoidance server-side.
Is Java or Python better for web scraping?
Python has the larger scraping community and a slightly gentler learning curve, but Java is fully capable and often the better fit when your data platform, services, or batch jobs already run on the JVM. Staying in one language and one build for both extraction and downstream processing is usually worth more than a marginal difference in library count.
Do I need the JavaScript option on the Crawling API for every page?
No. Pass javascript=true only for pages that render their content client-side. Static, server-rendered pages return their data without it, and skipping rendering on those is faster and cheaper. The simplest rule: turn it on when a plain fetch returns an empty shell, and leave it off otherwise.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
