Generic Extractor
A site-agnostic extractor — pulls links, images, headings, and main content from any web page.
API usage
Add &scraper=generic-extractor to a Crawling API request. URL-encode the target URL in the url parameter.
curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
--data-urlencode 'url=https://stackoverflow.com/' \
--data-urlencode 'scraper=generic-extractor' -Gfrom crawlbase import CrawlingAPI
api = CrawlingAPI({'token': 'YOUR_TOKEN'})
res = api.get(
'https://stackoverflow.com/',
{'scraper': 'generic-extractor'}
)
import json
data = json.loads(res['body'])const { CrawlingAPI } = require('crawlbase');
const api = new CrawlingAPI({ token: 'YOUR_TOKEN' });
const res = await api.get(
'https://stackoverflow.com/',
{ scraper: 'generic-extractor' }
);
const data = JSON.parse(res.body);require 'crawlbase'
api = Crawlbase::API.new(token: 'YOUR_TOKEN')
res = api.get('https://stackoverflow.com/', scraper: 'generic-extractor')
data = JSON.parse(res.body)Example input URL
The URL passed in the url parameter (URL-decoded for readability):
https://stackoverflow.com/Response shape
JSON response body. Field types may be null when the source page omits the value.
Final URL.
Page title tag.
Meta description.
Canonical link.
Detected language.
h1/h2/h3 arrays of heading text.
Outbound links with href, text, rel.
Image URLs with alt text.
Extracted readable body text.
Sample response
{
"url": "https://stackoverflow.com/",
"title": "Stack Overflow - Where Developers Learn...",
"language": "en",
"headings": {
"h1": ["Where developers grow together"],
"h2": ["Hot Network Questions"]
}
}
