cURL for Web Scraping: headers, cookies, proxies, and pipes

Q: How do I set a User-Agent in cURL?

Use the -A flag with a browser-like string, for example curl -A "Mozilla/5.0 (...)" https://example.com, or set it as a header with -H "User-Agent: ...". The default curl/x.y.z identifier is an immediate tell that the request is not a browser, so a realistic User-Agent is the first header to fix when a page blocks you.

Q: How do I send POST data with cURL?

Use -d, which sends the data and switches the method to POST automatically. By default it sends form-encoded data (-d "a=1&b=2"). For JSON, add -H "Content-Type: application/json" and pass the body as a single-quoted JSON string so the shell does not mangle the quotes.

Q: How do I use a proxy with cURL?

Use the -x flag: curl -x http://host:port https://example.com, adding user:pass@ before the host for an authenticated proxy. A single static proxy only changes your IP once, though; to scrape at volume you need rotation across many IPs, which a service like the Smart AI Proxy provides behind one endpoint.

cURL is the fastest way to pull a web page from a terminal. It speaks HTTP, HTTPS, FTP, and more, it ships on almost every machine you will ever touch, and it does exactly one thing well: send a request and hand you back the raw response. That makes cURL for web scraping a natural first tool, whether you are probing an endpoint by hand, prototyping a request before you write real code, or wiring a one-liner into a shell script.

This guide is a practical, command-line walkthrough. You will fetch a page, save it to disk, follow redirects, set headers and a User-Agent, send cookies, POST form and JSON data, route through a proxy, and read the HTTP status code. Then we look at where bare cURL hits a wall (JavaScript-rendered pages and anti-bot blocks) and how to get finished HTML back through the Crawlbase Crawling API and Smart AI Proxy without leaving the terminal.

Why cURL for scraping at all

Most scraping ends up in Python or Node, so why start at the command line? Because cURL is the shortest path from "I wonder what this URL returns" to an answer. There is nothing to install, no script to write, no dependencies to manage. You type one command and see the exact bytes the server sent, headers included. When a scraper misbehaves, reproducing the request in cURL is the quickest way to find out whether the problem is your code or the server.

cURL is also composable. Because it writes raw output to stdout, you can pipe it straight into a parser, a grep, or a file, and loop it over a list of URLs in a few lines of bash. For small jobs that pattern is often all you need. For larger ones, cURL is still the tool you reach for to debug the request before you scale it up.

Fetch a page with a GET request

The simplest possible scrape is a bare GET. Point cURL at a URL and it prints the response body to your terminal.

bash

curl https://example.com

By default cURL makes a GET request, so there is no flag to add. The HTML floods your terminal, which is fine for a quick look but not for anything larger. The next few moves are about controlling that output: saving it, quieting the noise, and reading the metadata around the body. For a deeper dive on the request side, see how to send GET requests with cURL.

Save the response to a file

Two flags handle saving. Use -o to write to a filename you choose, or -O (capital O) to keep the remote filename from the URL.

bash

# Write the body to a named file
curl -o page.html https://example.com

# Keep the remote filename (saves as report.pdf)
curl -O https://example.com/files/report.pdf

Saving to disk is the foundation of any batch scrape: fetch once, parse later, and avoid re-hitting the server while you iterate on your extraction logic. It also keeps a raw copy you can diff when a page's markup changes under you.

Follow redirects with -L

By default cURL does not follow redirects. If a URL returns a 301 or 302, you get the redirect response itself, not the page it points to, which is a classic reason a scrape comes back empty. The -L flag tells cURL to follow the Location header to the final destination.

bash

curl -L https://example.com/old-path

Most production sites redirect HTTP to HTTPS, normalize trailing slashes, or move pages around, so -L is a flag you will almost always want on. If you ever need to inspect a redirect chain instead of following it blindly, drop -L and read the status line and Location header directly.

Set a User-Agent and other headers

cURL announces itself with a User-Agent like curl/8.4.0, which many sites flag instantly as a non-browser. Sending a realistic User-Agent is the single highest-value header you can set. Use -H to add any header, or the dedicated -A flag for the User-Agent.

bash

# Set a browser-like User-Agent with -A
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" https://example.com

# Add multiple headers with repeated -H flags
curl https://example.com \
  -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)" \
  -H "Accept-Language: en-US,en;q=0.9" \
  -H "Referer: https://www.google.com/"

Realistic headers make your request look like a browser instead of a script. A sensible default set is a current User-Agent, an Accept and Accept-Language pair, and sometimes a Referer. For the full picture on which headers matter and why, see how to send HTTP headers with cURL.

Send and store cookies

Some pages need a session before they return content: a consent cookie, a region selection, or a logged-in state. cURL handles cookies with two flags. Use -b to send cookies and -c to save the cookies a server sets into a file (a cookie jar) so a later request can reuse them.

bash

# Send a cookie inline
curl -b "consent=1; region=us" https://example.com

# First request: save cookies the server sets into a jar
curl -c cookies.txt https://example.com/start

# Second request: replay the saved cookies
curl -b cookies.txt https://example.com/dashboard

The save-then-replay pattern lets a two-step flow work from the command line: hit the entry page to collect cookies, then carry them into the page you actually want. You can combine -b and -c in one request to both load and update the jar as a session evolves.

POST form and JSON data

Scraping is not always GET. Search forms, filters, and APIs often expect a POST. The -d flag sends data and implicitly switches the method to POST. By default it sends form-encoded data; set Content-Type: application/json to send a JSON body.

bash

# Form-encoded POST (application/x-www-form-urlencoded)
curl -d "query=laptops&page=2" https://example.com/search

# JSON POST: set the method and content type explicitly
curl -X POST https://example.com/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "laptops", "page": 2}'

For form data, separate fields with & just as a browser would. For JSON, wrap the body in single quotes so the shell leaves the double quotes intact, and always send the matching Content-Type or the server may reject or misread the payload. Hitting a JSON API directly like this is often cleaner than scraping rendered HTML, when the endpoint is available.

Route cURL through a proxy

Scraping from one IP at volume gets that IP rate-limited or blocked. Routing through a proxy puts a different address in front of your request. The -x flag sets the proxy, with optional user:pass@ credentials.

bash

# Plain HTTP proxy
curl -x http://proxy.example.com:8080 https://example.com

# Authenticated proxy
curl -x http://user:[email protected]:8080 https://example.com

A single static proxy only moves the problem to a new IP. What keeps a real scrape alive is rotation across many addresses so no single one trips a limit. The full mechanics of pointing cURL at a proxy, including SOCKS and rotating endpoints, are covered in how to use cURL with a proxy.

Silent, verbose, and reading the status code

For scripting you usually want cURL quiet, except when something breaks, then you want everything. The -s flag silences the progress meter; -v dumps the full request and response, headers and all, which is invaluable for debugging a block. To read just the HTTP status code, use a write-out template.

bash

# Silent: no progress meter, just the body
curl -s https://example.com -o page.html

# Verbose: see the full request/response exchange
curl -v https://example.com

# Print only the HTTP status code, discard the body
curl -s -o /dev/null -w "%{http_code}\n" https://example.com

That last command is a workhorse: it throws the body away, prints just the status code, and gives you a clean signal for a script to branch on. A 200 means proceed; a 403 or 429 means you are being challenged or throttled and should back off or change tactics. Reading status codes is how you tell a healthy run from a blocked one.

Pipe cURL into a parser

cURL fetches; it does not parse. The Unix answer is to pipe its output into a tool that does. For a quick field grab, pipe into grep or a small Python one-liner; for real HTML querying, a tool like htmlq (CSS selectors at the command line) is hard to beat.

bash

# Grab the <title> with grep
curl -s https://example.com | grep -o '<title>[^<]*</title>'

# Parse properly with htmlq (CSS selectors)
curl -s https://example.com | htmlq 'h1' --text

# Hand the HTML to Python for structured extraction
curl -s https://example.com | python3 -c \
  "import sys; from bs4 import BeautifulSoup; \
print(BeautifulSoup(sys.stdin.read(), 'html.parser').title.text)"

Regex with grep is fine for a one-off field but fragile on real markup; reach for an HTML-aware parser the moment the structure gets nested. The Python pipe scales naturally into a full BeautifulSoup script once your extraction outgrows a single selector.

Loop over a list of URLs in bash

One URL is a test; a scrape is a list. A short bash loop reads URLs from a file, fetches each one, checks the status, and saves the good ones. Pacing between requests keeps you from hammering the server.

bash

while read url; do
  slug=$(echo "$url" | md5sum | cut -c1-8)
  code=$(curl -s -L -o "out_$slug.html" -w "%{http_code}" "$url")
  echo "$code  $url"
  sleep 2
done < urls.txt

This pattern covers a surprising amount of ground: a file of URLs, one fetch each, redirects followed, the status logged, output saved under a stable name, and a two-second pause between hits. Tune the sleep to the target's tolerance, and you have a polite, restartable crawler in six lines.

Where bare cURL hits a wall

cURL is an HTTP client, not a browser. That distinction is the whole story of where it stops working for modern scraping, and it shows up in two ways.

It cannot run JavaScript. cURL fetches the raw HTML the server sends and nothing more. On a client-rendered site (most React, Vue, and Angular apps), that initial HTML is an almost-empty shell; the real content is painted in by scripts that run in a browser. cURL never runs them, so you get the skeleton and none of the data. No flag fixes this, because there is no JavaScript engine to turn on.

It gets blocked. Even on server-rendered pages, anti-bot systems read the request: a datacenter IP, a missing or default User-Agent, no browser-like header set, and no TLS fingerprint that matches a real browser. Bare cURL fails most of those checks and earns a 403, a CAPTCHA, or a silent decoy page. You can paper over some of it with headers and a proxy, but on hard targets it is a losing arms race. The broader playbook is in how to scrape websites without getting blocked.

The two-part problem

A working scrape of a modern site needs two things at once: a browser that actually renders the page, and an IP the site reads as a real visitor. cURL provides neither. You can assemble a headless browser plus a rotating residential proxy pool yourself, but keeping that stack healthy is most of the work. The next section folds both into a single cURL call.

Get finished HTML with the Crawling API

The cleanest way to keep using cURL while solving rendering and blocking is to call the Crawlbase Crawling API from the command line. You pass your token and the target URL; the API fetches the page behind a trusted residential IP and returns the HTML. Add &javascript=true and it renders the page in a real browser first, so client-side content is present in what you get back.

bash

# Static fetch through the Crawling API
curl "https://api.crawlbase.com/?token=YOUR_TOKEN&url=https://example.com"

# JavaScript-rendered fetch for client-side pages
curl "https://api.crawlbase.com/?token=YOUR_JS_TOKEN&javascript=true&url=https://example.com"

The URL you want to scrape goes in the url parameter and should be URL-encoded if it contains its own query string. Use your normal token for static pages and the JavaScript token with javascript=true for rendered ones. Because the response is just HTML on stdout, every piping and parsing trick from earlier still applies: redirect it into a file, pipe it into htmlq, or hand it to a Python parser. You keep the cURL workflow and offload the rendering and IP problems.

Crawlbase Crawling API

Bare cURL gets blocked and cannot run JavaScript. The Crawling API takes your token and a URL, fetches the page behind a rotating residential IP, optionally renders it in a real browser, and returns finished HTML to stdout, so it drops straight into your existing cURL pipes and loops. Start on the free tier and keep the same command-line workflow.

Start free

Route cURL through the Smart AI Proxy

If you would rather keep your own request logic and only swap the network path, the Smart AI Proxy (also called the AI Proxy) gives you rotating residential IPs behind a single proxy endpoint. It is a drop-in for the -x flag you already know, so your existing cURL command barely changes.

bash

curl -x http://YOUR_TOKEN:@smartproxy.crawlbase.com:8012 -k \
  https://example.com

Here your token is the proxy username (the password is left empty), and every request through the endpoint is rotated across the residential pool automatically. The -k flag tells cURL to accept the proxy's TLS interception. This is the right choice when your scraper is already built around cURL or a library's proxy support and you only need the IP rotation, rather than full server-side rendering. For client-rendered pages, prefer the Crawling API with javascript=true.

Recap

Key takeaways

cURL is the fastest fetch-and-debug tool. A bare curl URL prints the response; -o/-O save it, -L follows redirects, and -s/-v control noise.
Headers and cookies make you look like a browser. Set a real User-Agent with -A or -H, and use -b/-c to send and store cookies across a session.
POST and proxies cover the rest of the basics. -d sends form or JSON data (with the right Content-Type), and -x routes through a proxy.
Pipe and loop to scale. Send cURL output into grep, htmlq, or Python, and wrap a bash loop with status checks and pacing over a URL list.
Bare cURL cannot render JS and gets blocked. Call the Crawling API (add javascript=true for rendered pages) or route through the Smart AI Proxy to get finished HTML while keeping the cURL workflow.

Frequently Asked Questions (FAQs)

Can cURL be used for web scraping?

Yes. cURL makes HTTP requests and returns the raw response, which you can save to a file or pipe into a parser like grep, htmlq, or a Python script. It is excellent for static, server-rendered pages and for debugging requests. Its limits are that it cannot run JavaScript and is easily blocked by anti-bot systems, so for client-rendered or protected sites you pair it with a rendering and proxy service.

Why does cURL return an empty or partial page?

Usually one of two reasons. Either the site is client-side rendered, so the real content is painted in by JavaScript that cURL never runs and you only get the HTML shell, or the site detected a non-browser request and served a block or decoy page. Check the status code with -w "%{http_code}": a 200 with no data points to JavaScript rendering, while a 403 or 429 points to blocking.

How do I set a User-Agent in cURL?

Use the -A flag with a browser-like string, for example curl -A "Mozilla/5.0 (...)" https://example.com, or set it as a header with -H "User-Agent: ...". The default curl/x.y.z identifier is an immediate tell that the request is not a browser, so a realistic User-Agent is the first header to fix when a page blocks you.

How do I send POST data with cURL?

Use -d, which sends the data and switches the method to POST automatically. By default it sends form-encoded data (-d "a=1&b=2"). For JSON, add -H "Content-Type: application/json" and pass the body as a single-quoted JSON string so the shell does not mangle the quotes.

How do I use a proxy with cURL?

Use the -x flag: curl -x http://host:port https://example.com, adding user:pass@ before the host for an authenticated proxy. A single static proxy only changes your IP once, though; to scrape at volume you need rotation across many IPs, which a service like the Smart AI Proxy provides behind one endpoint.

How do I scrape JavaScript-heavy pages with cURL?

You cannot do it with cURL alone, because cURL has no JavaScript engine and never executes the scripts that render the content. The practical fix is to call a rendering service from cURL: send the URL to the Crawling API with javascript=true and it renders the page in a real browser, then returns the finished HTML to stdout, where your usual cURL piping and parsing still work.

Hassan Rehan

Software Engineer · Crawlbase

Software engineer at Crawlbase writing hands-on guides on rotating proxies, scraping, and the practical details of wiring proxies into real code.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

Why cURL for scraping at all

Fetch a page with a GET request

Save the response to a file

Follow redirects with -L

Set a User-Agent and other headers

Send and store cookies

POST form and JSON data

Route cURL through a proxy

Silent, verbose, and reading the status code

Pipe cURL into a parser

Loop over a list of URLs in bash

Where bare cURL hits a wall

Get finished HTML with the Crawling API

Route cURL through the Smart AI Proxy

Key takeaways

Frequently Asked Questions (FAQs)

Can cURL be used for web scraping?

Why does cURL return an empty or partial page?

How do I set a User-Agent in cURL?

How do I send POST data with cURL?

How do I use a proxy with cURL?

How do I scrape JavaScript-heavy pages with cURL?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

Inside Modern Anti-Bot Evasion: A Systems View

How to Scrape Local Business Listings with Python: names, addresses, ratings, and more

Build a Website Change Tracker with Python: snapshots and SHA-256 diffs

The infrastructure brief, in your inbox.

We use cookies

Customize cookies