Generic Extractors

Overview

Generic extractors fill the gaps between named scrapers. When the site you need isn't in the catalog yet - niche marketplaces, regional retailers, internal portals - these two scrapers let you describe the page yourself and we run the extraction.

generic-extractor takes a CSS-selector schema (or our auto-detection) and returns the parsed values. email-extractor is purpose-built for one common task: pulling every email address visible on a page, regardless of how the page hides them (mailto links, plain text, slightly-obfuscated patterns like name [at] domain.com).

Common use cases:

Long-tail catalog ingestion: drop a schema for a regional retailer, run nightly imports without us shipping a dedicated scraper for it.
Lead generation: walk a list of company websites, run email-extractor, build a contactable prospect list (subject to your jurisdiction's outbound-email rules).
Research pipelines: extract structured fields (titles, headings, meta) from any page for downstream NLP - useful when you need normalised input from heterogeneous sources.
Site monitoring: define a schema once, monitor a competitor's pricing or copy changes by diffing the parsed JSON over time.

Both scrapers ride the same anti-bot, residential-routing, and JS-rendering stack as the named scrapers - so the auto-detection works on JS-heavy SPAs without you wiring up a separate browser. If a target needs a dedicated parser eventually, the schema you wrote is a good handoff document for our scraper team.

Two universal building blocks - one for arbitrary structured extraction, one for the always-needed task of pulling emails. Use these when there's no named scraper for the site you care about.

Generic Extractor - schema-driven HTML extractor. Pass selectors, get back structured JSON.
Email Extractor - pulls every email address visible on a page.

Example call

Below: a generic-extractor call against Stack Overflow's homepage. With no schema specified, the scraper returns auto-detected metadata - page title, language, and headings grouped by level. Pass a custom selectors object (see the full reference) to extract specific fields.

curl 'https://api.crawlbase.com/?token=YOUR_TOKEN' \
  --data-urlencode 'url=https://stackoverflow.com/' \
  --data-urlencode 'scraper=generic-extractor' -G

Sample response

{
  "url": "https://stackoverflow.com/",
  "title": "Stack Overflow - Where Developers Learn...",
  "language": "en",
  "headings": {
    "h1": ["Where developers grow together"],
    "h2": ["Hot Network Questions"]
  }
}

Full reference (parameters, all 4 SDK languages, edge cases):Generic Extractor - full reference