A data pipeline architecture is the set of components and the order in which they run that moves data from where it is produced to where it gets used. Get the shape right and analysts query fresh, trustworthy tables without thinking about how the data got there; get it wrong and you spend your week chasing missing rows, malformed fields, and jobs that silently stopped firing. This guide is a conceptual walk through that architecture for engineers: the canonical stages, batch versus streaming, how the whole thing is orchestrated, how you keep it observable, and where web-scraped data fits in.

The framing here is deliberately practical. Most pipeline diagrams look tidy on a whiteboard and fall apart the first time an upstream source changes its schema or a third-party site blocks your collector. So alongside the clean model, this covers the parts that actually break, and treats the ingest layer, the place external data enters your system, as a first-class concern rather than an afterthought.

What a data pipeline architecture actually is

At its core, a data pipeline is a directed flow: data enters, passes through a series of transformations, and lands somewhere it can be read. The architecture is the contract around that flow, which sources feed it, what each stage guarantees, how failures are handled, and how the whole thing is scheduled and watched. It is the difference between a one-off script and a system you can rely on at 3 a.m.

The value of treating it as an architecture rather than glue code is consolidation and uniformity. A real pipeline pulls data from many sources, including databases, APIs, event streams, and scraped web pages, and reshapes all of it into one consistent format in one place. That single funnel is what lets a team query across sources without reconciling five different shapes by hand, and it is what reduces the friction between raw data arriving and an insight coming out the other end.

The stages of a data pipeline

Almost every pipeline, regardless of tooling, follows the same canonical stages in order. Naming varies between teams, but the sequence does not:

  • Ingest / collect. Data enters the pipeline from its sources: operational databases, third-party APIs, event streams, files, and the web. This is where raw records first land, often in a staging area before anything touches them.
  • Process / transform. Raw data is cleaned, standardized, validated, deduplicated, joined across sources, and reshaped into the schema downstream consumers expect. Units, dates, and categories get normalized here, and corrupt or invalid records are corrected or dropped.
  • Store. The transformed data is written to a durable destination, typically a data warehouse, a data lake, or both. This is the system of record that everything downstream reads from.
  • Serve / analyze. The stored data is exposed to its consumers: BI dashboards, ad-hoc SQL, machine-learning training jobs, reverse-ETL back into operational tools, or an API. This is the stage that justifies all the others.

Two cross-cutting concerns wrap every stage rather than sitting between them. Orchestration decides when each step runs and in what order, and monitoring watches that each step did what it claimed. Neither is a stage you pass through once; both run for the lifetime of the pipeline. We come back to each below.

The four stages, in order. Data moves left to right through ingest, process, store, and serve, while orchestration and monitoring span all of them for the life of the pipeline.

ETL vs ELT: where the transform happens

The classic model is ETL: Extract, Transform, Load. You pull data out of sources, reshape it in a dedicated processing layer, and load the finished result into the warehouse. It keeps the warehouse clean but means the transform logic lives outside it.

The modern default has flipped to ELT: Extract, Load, Transform. You land raw data in a cloud warehouse first, then transform it in place with SQL. Storage is cheap enough that keeping raw data around pays for itself, because you can re-derive any table when requirements change instead of re-collecting from source. For scraped data this matters a lot: re-running a transform is free, but re-crawling a site you no longer have access to is not. Keep the raw HTML or JSON you collected, and ELT lets you fix a parsing bug months later without touching the source again.

Batch vs streaming

The single biggest architectural fork is how often data moves. Batch pipelines collect data over a window, an hour, a day, a fixed run, and process it as a group. They are simpler to reason about, cheaper to operate, easy to reprocess, and correct for the large majority of analytics work. If a daily sales rollup is the goal, batch is almost always the right answer.

Streaming pipelines process records continuously, event by event, as they arrive, usually through a log like Kafka or a managed equivalent. You reach for streaming when freshness is the product: fraud detection, live pricing, real-time competitor monitoring, anything where a one-hour-old answer is a wrong answer. The cost is real, though, because streaming systems are harder to test, harder to reprocess, and demand thinking about late-arriving and out-of-order events from day one.

Many mature setups run a hybrid: a streaming path for the few metrics that need to be live, and a batch path for everything else, often keeping raw data so new questions can be answered later without re-collection. Pick the simplest model that meets the actual freshness requirement, and resist streaming a number nobody reads until tomorrow.

Latency is a requirement, not a default

Before choosing streaming, write down the freshness the business actually needs in plain numbers. "Within five minutes" and "by tomorrow morning" lead to completely different architectures, operating costs, and on-call burdens. Most teams over-estimate how fresh their data has to be and pay for streaming complexity they never use.

Orchestration and scheduling

Once you have more than one step, something has to decide what runs, in what order, and what happens when a step fails. That is orchestration, and it is the nervous system of the pipeline. A scheduler kicks jobs off on a cadence or in response to an event; an orchestrator models the dependencies between jobs so a transform only runs after its ingest succeeds, and so a failure halts everything downstream instead of feeding garbage forward.

In practice this is a directed acyclic graph (DAG): each node is a task, each edge is a dependency, and the orchestrator walks the graph, retrying transient failures and surfacing permanent ones. Tools like Airflow, Dagster, and Prefect exist for exactly this. The architectural point is independent of the tool: automate scheduling so runs are repeatable, make dependencies explicit so failures are contained, and make the whole graph idempotent so a re-run produces the same result instead of double-counting.

Here is a minimal sketch of a daily DAG that collects, transforms, and loads, the orchestration shape rather than production code:

python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id='market_prices',
    schedule='@daily',
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:
    ingest = PythonOperator(task_id='ingest', python_callable=collect_pages)
    transform = PythonOperator(task_id='transform', python_callable=clean_and_parse)
    load = PythonOperator(task_id='load', python_callable=write_to_warehouse)

    ingest >> transform >> load

The >> operators declare the dependency chain: transform waits for ingest, load waits for transform. If ingest fails, nothing downstream runs, which is exactly the behavior you want.

Monitoring and data quality

A pipeline you cannot observe is a pipeline you cannot trust. Monitoring splits into two questions that are easy to conflate. The first is operational: did the job run, when did it start and stop, what was its runtime, did it exit cleanly, and what did the errors say. This is the same discipline you apply to any production system, and without it you have no way to know the pipeline is even alive.

The second question is harder and more important: is the data correct? A job can exit zero and still produce garbage. Data-quality checks belong inside the pipeline as gates, not as a post-hoc dashboard. Assert row counts fall in an expected range, that key columns are non-null, that values match expected formats, and that today's volume has not silently collapsed to a tenth of yesterday's. When a check fails, the pipeline should stop and alert rather than load bad data and let it propagate into every downstream report.

For scraped sources, data-quality monitoring doubles as collection monitoring. A sudden drop in parsed records usually means the source site changed its markup or started blocking you, not that the world ran out of data. Treating a volume drop as a first-class alert turns a silent failure into an actionable one.

Where web-scraped data enters the pipeline

Web data is one of the richest external sources you can feed a pipeline, including prices, listings, reviews, and public market signals, but it is also the most operationally hostile. Internal databases and partner APIs hand you clean, stable structures. The open web hands you rendered HTML behind anti-bot defenses that change without notice. That hostility lives entirely in the ingest stage, so the reliability of your whole pipeline often comes down to how robust your collection layer is.

Trying to build that layer yourself means running headless browsers to render JavaScript-heavy pages, maintaining a pool of residential proxies so you are not blocked on the first request, solving CAPTCHAs, and keeping all of it healthy as targets evolve. That is a standing system to operate, and it has nothing to do with your actual transforms. The pragmatic move is to treat collection as a managed service so the ingest stage hands your pipeline clean data and you spend your engineering time downstream. For the general playbook on staying collectable, how to scrape websites without getting blocked covers the failure modes in depth.

This is where Crawlbase fits as the ingest layer. The Crawling API takes a URL plus an optional JavaScript token, renders the page in a real browser behind a rotating residential IP, and returns the finished HTML or parsed JSON, so a client-side-rendered store or marketplace comes back fully populated in a single call. For raw HTTP routing you control directly, the Smart AI Proxy exposes the same rotating-IP backbone as a standard proxy endpoint, and the Crawling API returns structured fields for common page types so you can skip writing parsers.

Crawlbase as your ingest layer

Make collection the reliable stage of your pipeline instead of the flaky one. The Crawling API renders JavaScript pages behind rotating residential IPs and returns clean HTML or JSON in one call, so your DAG's ingest task just gets data. Start on the free tier and point it at a real source before you wire up the rest.

A minimal ingest task using the Crawling API looks like this, a single call that returns rendered HTML ready for your transform step:

python
from crawlbase import CrawlingAPI

api = CrawlingAPI({'token': 'YOUR_CRAWLBASE_JS_TOKEN'})

def collect_page(url):
    response = api.get(url, {'ajax_wait': True, 'page_wait': 5000})
    if response['status_code'] == 200:
        return response['body']  # rendered HTML, ready to parse
    raise RuntimeError(f'collect failed: {response["status_code"]}')

Scaling ingest with an async crawler

A synchronous call per URL is fine for hundreds of pages. Once you are collecting tens or hundreds of thousands of URLs on a schedule, blocking your DAG on each request stops making sense. This is the threshold where you move from a synchronous API to an asynchronous one.

The Crawler is built for this scale. You push large batches of URLs into a queue and the service crawls them asynchronously in the background, then delivers each result to a callback (webhook) endpoint you control as it completes, rather than making you hold a connection open per page. Your ingest stage becomes "enqueue the URLs and move on," and a separate handler writes results into staging as the callbacks land. That decoupling is exactly the batch model applied to collection, and it keeps a massive crawl from becoming a single brittle, long-running job. For org-scale collection with dedicated throughput and support, the enterprise tier extends the same model.

For staging, Crawlbase Storage can hold crawled responses so collection and parsing stay decoupled: the crawler writes raw responses to storage, and your transform step reads from there on its own schedule. That separation is the ELT pattern again, with raw data preserved so you can reparse later without re-crawling. The economics matter here too, since collection is usually the most expensive stage of a web-data pipeline; for ways to keep that cost in check, see ecommerce web scraping, which works through a high-volume collection scenario end to end.

A reference architecture for web-data pipelines

Putting the pieces together, a robust pipeline for web-sourced data tends to look like this. An orchestrator runs on a schedule and enqueues target URLs to the async crawler. The crawler collects asynchronously and writes raw responses to a staging store, untouched. A transform step reads raw responses, parses them into structured rows, applies data-quality gates, and loads the clean output into the warehouse. Serving tools then read from the warehouse, never from the collector.

The discipline that makes this hold up is keeping each concern in its own stage. Collection does not parse; transform does not collect; serving does not touch raw data. When a target site changes its markup, you fix the transform and reprocess from staging, with no re-crawl. When you need a faster cadence, you change the schedule, not the code. And because raw data is preserved, a parsing bug discovered in month six is a reprocess, not a data-loss event. If proxies and IP rotation are new territory, what is a proxy server is a useful primer on the layer underneath your ingest stage.

Recap

Key takeaways

  • The stages are universal. Ingest, process, store, and serve, in that order, with orchestration and monitoring wrapping all of them. The tools change; the sequence does not.
  • Choose batch unless freshness is the product. Streaming is powerful and expensive; write down the latency the business actually needs before reaching for it.
  • Prefer ELT and keep raw data. Landing raw data first lets you re-derive tables when requirements change, which is critical when re-collecting from source is costly or impossible.
  • Orchestrate with explicit dependencies. A DAG with idempotent, retryable tasks contains failures instead of feeding garbage downstream.
  • Monitor data quality, not just job status. A job can exit clean and still produce bad data; gate on row counts, nulls, and formats inside the pipeline.
  • Treat ingest as a managed concern. Web collection is the most hostile stage; using the Crawling API or async Crawler for it keeps the rest of your pipeline simple.

Frequently Asked Questions (FAQs)

What is data pipeline architecture in simple terms?

It is the set of components and the order they run in that moves data from where it is created to where it is used. The canonical flow is ingest, then process and transform, then store, then serve or analyze, with orchestration deciding when each step runs and monitoring confirming each step worked. The architecture is the contract around that flow: what each stage guarantees and how failures are handled.

What is the difference between ETL and ELT?

Both extract, load, and transform data; the difference is the order. ETL transforms data in a dedicated layer before loading the finished result into the warehouse. ELT loads raw data into the warehouse first and transforms it there with SQL. ELT is the modern default because cheap storage makes keeping raw data worthwhile: you can re-derive any table when requirements change instead of re-collecting from source.

When should I use a streaming pipeline instead of batch?

Use streaming only when freshness is the product, for example fraud detection, live pricing, or real-time monitoring where a one-hour-old answer is wrong. For the large majority of analytics, batch is simpler, cheaper, easier to reprocess, and correct. Decide by writing down the latency the business actually requires; most teams over-estimate how fresh their data needs to be.

How does web-scraped data fit into a data pipeline?

Scraped data enters at the ingest stage, the same place as databases and APIs, but it is the most operationally hostile source because the open web defends against bots and changes its markup without notice. The reliable pattern is to treat collection as a managed service that hands your pipeline clean HTML or JSON, then run normal transform, store, and serve stages on it. That keeps the web's instability contained to one stage.

How does Crawlbase work as the ingest layer?

The Crawling API takes a URL plus an optional JavaScript token, renders the page in a real browser behind a rotating residential IP, and returns finished HTML or parsed JSON in one call, so even client-side-rendered pages come back populated. For large or scheduled collection, the async Crawler queues batches of URLs, crawls them in the background, and delivers results to a callback endpoint, with Storage available to stage raw responses so collection and parsing stay decoupled.

Why do I need monitoring beyond checking that jobs ran?

Because a job can exit successfully and still produce wrong data. Operational monitoring tells you whether the job ran and when; data-quality monitoring tells you whether the output is correct. Gate the pipeline on assertions like expected row counts, non-null key columns, valid formats, and stable volume, so it stops and alerts on bad data instead of loading it into every downstream report. For scraped sources, a volume drop is often the first sign the collector got blocked.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available