When you crawl at volume, the slowest part of the job is waiting. A synchronous scraper sends a request, blocks until a heavily defended page renders and returns, parses it, and only then moves to the next URL. Stack thousands of URLs behind that and you spend most of your runtime idle. The asynchronous pattern flips it: you hand a batch of URLs to a crawler, it does the slow rendering work on its own infrastructure, and it posts each finished result back to a webhook you control. This guide builds exactly that, a Flask callback server that receives results from the Crawlbase async Crawler and stores them in MySQL.

The example target is LinkedIn, but read the caution directly below first, because what you store matters more than how you store it. This walkthrough is deliberately scoped to public, non-personal data: company-page fields and public job-posting text, not individual member profiles. The real teaching value here is the architecture, async crawl plus callback server plus database, and that pattern works the same whatever public source you point it at.

Read this before you build

LinkedIn's User Agreement strongly restricts automated access, and most LinkedIn data is sensitive personal data about identifiable people. This tutorial collects only PUBLIC, non-personal fields (company name, public company description, public job-posting text), never member profiles, connections, or anything behind a login. If personal data is involved, GDPR and CCPA apply: you need a lawful basis and must honor deletion. For any real or commercial use, the sanctioned path is LinkedIn's official APIs and partner programs, not a scraper. See the full legality section near the end before you point this at anything live.

What you will build

A small asynchronous crawling system in Python with three moving parts and a shared MySQL database. Instead of one blocking script, work is split so the slow crawl happens off your machine and results arrive as they finish:

  • An async Crawler request script that pushes a list of public URLs to the Crawlbase async Crawler and records a request id (RID) for each.
  • A Flask callback server that receives each finished crawl as an HTTP POST, decompresses it, and saves the raw payload.
  • A processor that reads saved payloads on a schedule, extracts the public fields, and writes structured rows.
  • A MySQL schema with a request-tracking table and tables for the public fields you keep.

Notice what is deliberately absent from the schema: no person names, no headlines, no individual profile summaries, no connection data. We store a company's public identity and public postings, which is impersonal information a company publishes about itself.

Why asynchronous, and why a callback server

A LinkedIn company page or public job posting renders client-side and sits behind aggressive bot defenses, so a single fetch is slow and often challenged. Doing that synchronously for a long list of URLs means your script blocks on every one in turn. The async Crawler accepts your URL, returns immediately with a request id, then does the slow rendering and retrying on its own infrastructure. When it finally gets a clean response, it pushes that result to your webhook with a POST.

That push model is why you need a callback server. Your endpoint does not poll and does not wait; it simply sits ready, and each result lands when it is done, in whatever order the crawls complete. The Crawler engine sends the body gzip-compressed, so your endpoint has to decompress before it can read anything. Decoupling the request, the receipt, and the processing into three scripts is what lets the system absorb a large batch without any single step blocking the others. If you want the broader background on the engine itself, see our guide on how to extract data using the Crawlbase Crawler.

Prerequisites

A few things to have in place. None take long.

Python 3.8 or later. Confirm with python3 --version. If you do not have it, install it from python.org.

MySQL 8. A running MySQL server you can connect to locally. The official installation manual covers every platform.

A Crawlbase account and a Normal (TCP) token. Sign up, open your dashboard, and copy your token. LinkedIn is served by the Normal request Crawler, so use the TCP token here, not the JavaScript one. Treat the token like a password and keep it out of version control.

A way to expose localhost. The Crawler posts to a public URL, so during development you need a tunnel such as ngrok to reach your local Flask app.

Set up the project

Create an isolated virtual environment, then install the libraries the system needs.

bash
python3 -m venv .venv
source .venv/bin/activate

pip install Flask mysql-connector-python pyyaml requests SQLAlchemy

On Windows, activate with .venv\Scripts\activate instead of the source line. Four libraries do the work: Flask is the webhook server, SQLAlchemy with mysql-connector-python handles the database, requests sends the crawl requests, and pyyaml reads your token from a settings file. Create a settings.yml alongside your scripts to hold the token and your Crawler name.

yaml
token: YOUR_CRAWLBASE_TOKEN
crawler: linkedin-public-crawler

Step 1: Design the MySQL schema

The schema has two jobs: track every crawl request through its lifecycle, and hold the public fields you keep. Create a user, a database, and the tables. Run these in the MySQL command-line client.

sql
CREATE USER 'linkedincrawler'@'localhost' IDENTIFIED BY 'linked1nS3cret';
CREATE DATABASE linkedin_crawler_db;
GRANT ALL PRIVILEGES ON linkedin_crawler_db.* TO 'linkedincrawler'@'localhost';
USE linkedin_crawler_db;

Now the tables. The crawl_requests table is the control table for the whole asynchronous process: every URL you push gets one row, tracked by its status as it moves through waiting, then received, then processed. The crawlbase_rid column ties a row back to the request id the Crawler returns, which is the only key you have to match an incoming callback to the request that triggered it.

sql
CREATE TABLE IF NOT EXISTS `crawl_requests` (
  `id` INT AUTO_INCREMENT PRIMARY KEY,
  `url` TEXT NOT NULL,
  `status` VARCHAR(30) NOT NULL,
  `crawlbase_rid` VARCHAR(255) NOT NULL
);

CREATE INDEX `idx_crawl_requests_status` ON `crawl_requests` (`status`);
CREATE INDEX `idx_crawl_requests_rid` ON `crawl_requests` (`crawlbase_rid`);

The destination tables hold only public, non-personal company data. One row per company page, plus a child table for the public job postings that page links to. There is no column anywhere for a person's name, title, or profile text. That is the privacy boundary made concrete in the schema itself.

sql
CREATE TABLE IF NOT EXISTS `company_pages` (
  `id` INT AUTO_INCREMENT PRIMARY KEY,
  `crawl_request_id` INT NOT NULL,
  `company_name` VARCHAR(255),
  `industry` VARCHAR(255),
  `description` TEXT,
  FOREIGN KEY (`crawl_request_id`) REFERENCES `crawl_requests`(`id`)
);

CREATE TABLE IF NOT EXISTS `company_job_postings` (
  `id` INT AUTO_INCREMENT PRIMARY KEY,
  `company_page_id` INT NOT NULL,
  `title` VARCHAR(255),
  `location` VARCHAR(255),
  `description` TEXT,
  FOREIGN KEY (`company_page_id`) REFERENCES `company_pages`(`id`)
);

Step 2: Define the ORM

Map those tables to Python classes with SQLAlchemy so the rest of the code works with objects, not raw SQL. Save this as lib/database.py. The classes mirror the schema exactly: a CrawlRequest for tracking, a CompanyPage for the public company fields, and a JobPosting child for each public posting.

python
from typing import List
from sqlalchemy import ForeignKey, create_engine
from sqlalchemy.orm import DeclarativeBase, Session, Mapped, mapped_column, relationship

class Base(DeclarativeBase):
    pass

class CrawlRequest(Base):
    __tablename__ = 'crawl_requests'
    id: Mapped[int] = mapped_column(primary_key=True)
    url: Mapped[str]
    status: Mapped[str]
    crawlbase_rid: Mapped[str]
    company_page: Mapped['CompanyPage'] = relationship(back_populates='crawl_request')

class CompanyPage(Base):
    __tablename__ = 'company_pages'
    id: Mapped[int] = mapped_column(primary_key=True)
    company_name: Mapped[str]
    industry: Mapped[str]
    description: Mapped[str]
    crawl_request_id: Mapped[int] = mapped_column(ForeignKey('crawl_requests.id'))
    crawl_request: Mapped['CrawlRequest'] = relationship(back_populates='company_page')
    job_postings: Mapped[List['JobPosting']] = relationship(back_populates='company_page')

class JobPosting(Base):
    __tablename__ = 'company_job_postings'
    id: Mapped[int] = mapped_column(primary_key=True)
    title: Mapped[str]
    location: Mapped[str]
    description: Mapped[str]
    company_page_id: Mapped[int] = mapped_column(ForeignKey('company_pages.id'))
    company_page: Mapped['CompanyPage'] = relationship(back_populates='job_postings')

def create_database_session():
    url = 'mysql+mysqlconnector://linkedincrawler:linked1nS3cret@localhost:3306/linkedin_crawler_db'
    engine = create_engine(url, echo=True)
    return Session(engine)

create_database_session returns a session every other script imports. The connection string carries the user, password, host, and database you set up in step 1; change them here if yours differ.

Step 3: Push URLs to the async Crawler

This script reads a list of public URLs, sends each to the async Crawler, and records the returned RID with a waiting status. The key parameters are callback=true, which tells the Crawler to POST results back instead of returning them inline, and crawler=, which names the Crawler you will create in the dashboard. Save it as crawl.py, and put your public company and job-posting URLs one per line in urls.txt.

python
import requests
import urllib.parse
import json
import yaml
from json import JSONDecodeError
from lib.database import CrawlRequest, create_database_session

settings = yaml.safe_load(open('settings.yml'))
token = settings.get('token')
crawler = settings.get('crawler')

if not token or not crawler:
    print('Set your token and crawler name in settings.yml')
    exit()

urls = open('urls.txt', 'r').readlines()
api = 'https://api.crawlbase.com?token={0}&callback=true&crawler={1}&url={2}&autoparse=true'
session = create_database_session()

for url in urls:
    url = url.strip()
    if not url:
        continue
    encoded = urllib.parse.quote(url, safe='')
    api_url = api.format(token, crawler, encoded)
    print(f'Requesting crawl for {url}')
    try:
        response = requests.get(api_url)
        rid = json.loads(response.text)['rid']
        request_row = CrawlRequest(url=url, crawlbase_rid=str(rid), status='waiting')
        session.add(request_row)
        session.commit()
    except JSONDecodeError:
        print(f'Could not decode response for {url}')

print('Done pushing crawl requests.')

Each call returns a small JSON body like {"rid": 12341234}. The script stores that RID in crawl_requests with status waiting and moves straight to the next URL without blocking on the actual crawl. The autoparse=true parameter asks the Crawler to return structured fields rather than raw HTML, which is what the processor in step 6 reads. That is the whole point of the async model: pushing a hundred URLs takes seconds, and the slow work happens elsewhere.

Crawlbase LinkedIn Scraper

The push you just wrote returns a RID in seconds because the slow part, rendering a defended LinkedIn page behind a trusted residential IP and retrying until it gets a clean 200, happens on Crawlbase infrastructure, not yours. The async Crawler queues your batch, does the rendering and rotation server-side, and posts each finished result to your webhook, so you never run a headless browser fleet or proxy pool. Start on the free tier.

Step 4: Build the Flask callback server

This is the heart of the system. The Crawler posts each finished result to a single route. Your job is to validate the request, decompress the body, and save the payload so the processor can pick it up. The Crawler sends the RID in a header named rid, and it sends two status headers, PC-Status (the Crawlbase status) and Original-Status (the target site's status). You only keep results where both are 200. Save this as callback_server.py.

python
import gzip
import os
from flask import Flask, request
from lib.database import CrawlRequest, create_database_session

app = Flask(__name__)
session = create_database_session()
os.makedirs('./data', exist_ok=True)

def header_status(name):
    value = request.headers.get(name)
    return int(value.split(',')[0]) if value else None

@app.route('/crawlbase_crawler_callback', methods=['POST'])
def crawlbase_crawler_callback():
    rid = request.headers.get('rid')
    encoding = request.headers.get('Content-Encoding')

    if rid is None:
        return ('', 204)
    if rid == 'dummyrequest':
        print('Callback server is working')
        return ('', 204)
    if header_status('PC-Status') != 200 or header_status('Original-Status') != 200:
        return ('', 204)

    crawl_request = session.query(CrawlRequest).filter_by(crawlbase_rid=rid, status='waiting').first()
    if crawl_request is None:
        print(f'No waiting request for rid {rid}')
        return ('', 204)

    body = request.data
    if encoding == 'gzip':
        try:
            body = gzip.decompress(body)
        except OSError:
            pass

    with open(f'./data/{rid}.json', 'wb') as f:
        f.write(body)

    crawl_request.status = 'received'
    session.commit()
    print(f'Received rid {rid}')
    return ('', 201)

if __name__ == '__main__':
    app.run(port=5000)

Walk through the guards, because each one matters. A missing rid means the request did not come from the Crawler, so it is dropped. A dummyrequest RID is the test ping the platform sends to confirm your endpoint is reachable; you log it and return early. The status check ignores anything that is not a clean 200 on both fronts. Then you look up the RID in crawl_requests with status waiting: if no such row exists, the callback does not correspond to a request you made, and it is ignored. Only after all of that do you decompress and save the body, then flip the row to received. The endpoint never blocks; it writes the file and returns immediately, which keeps it responsive even under a flood of callbacks.

Protect your webhook

Your callback URL is public while the tunnel is open. Harden it: accept only POST, require a secret token in a custom header or URL parameter that you verify on every request, and confirm the expected rid, PC-Status, and Original-Status headers are present. Avoid IP allowlisting, since the source addresses rotate and can change without notice.

Step 5: Expose the server and register the Crawler

The Crawler needs a public URL to post to. With the Flask app running on port 5000, open a tunnel.

bash
python callback_server.py
ngrok http 5000

ngrok prints a public HTTPS URL. Your full callback route is that URL plus the path, for example https://your-subdomain.ngrok.io/crawlbase_crawler_callback. Confirm the endpoint is alive with a test ping before involving the Crawler at all.

bash
curl -i -X POST 'http://localhost:5000/crawlbase_crawler_callback' \
  -H 'rid: dummyrequest' \
  -H 'Content-Type: gzip/json' \
  -H 'Content-Encoding: gzip'

You should see Callback server is working in the Flask log. Now go to your Crawlbase dashboard, open the Create Crawler page, give the Crawler the same name you put in settings.yml, and paste your full ngrok callback URL. LinkedIn is served by the Normal request (TCP) Crawler, so select that type. Once saved, the Crawler knows where to push results.

Step 6: Process received payloads into structured rows

The callback server only saves raw payloads. A separate processor runs on a schedule, picks up everything in received status, extracts the public fields, writes the structured rows, and marks the request processed. Splitting receipt from processing means a slow database write never blocks the webhook. Save this as process.py.

python
import json
import sched
import time
from lib.database import CrawlRequest, CompanyPage, JobPosting, create_database_session

INTERVAL_SECONDS = 60
BATCH_LIMIT = 10

def process():
    session = create_database_session()
    received = session.query(CrawlRequest).filter_by(status='received').limit(BATCH_LIMIT).all()

    if not received:
        print('No received requests to process.')
        return

    for req in received:
        with open(f'./data/{req.crawlbase_rid}.json') as f:
            data = json.load(f)

        page = CompanyPage(
            company_name=data.get('name'),
            industry=data.get('industry'),
            description=data.get('description'),
        )
        page.crawl_request_id = req.id
        session.add(page)

        for job in data.get('jobs', []):
            posting = JobPosting(
                title=job.get('title'),
                location=job.get('location'),
                description=job.get('description'),
            )
            posting.company_page = page
            session.add(posting)

        req.status = 'processed'

    session.commit()

def process_and_reschedule():
    process()
    scheduler.enter(INTERVAL_SECONDS, 1, process_and_reschedule)

if __name__ == '__main__':
    scheduler = sched.scheduler(time.monotonic, time.sleep)
    process_and_reschedule()
    scheduler.run()

The processor reads only the impersonal company and job fields from the parsed payload: the company name, industry, public description, and each public posting's title, location, and description. It never touches any person-level field, even if one happened to be present in the payload. Keeping the extraction list this tight is the second half of the privacy boundary, after the schema. The sched loop re-runs process every 60 seconds and drains at most ten requests per pass, which keeps memory flat under a large backlog.

Run the whole pipeline

With the Crawler registered, run the three pieces, each in its own terminal with the virtual environment active. Order matters: the callback server and processor must be up before you push requests, or early callbacks arrive with nowhere to land.

bash
# terminal 1: webhook (already running, plus ngrok)
python callback_server.py

# terminal 2: scheduled processor
python process.py

# terminal 3: push the batch
python crawl.py

As crawl.py runs, rows appear in crawl_requests with status waiting. Minutes later, as the Crawler finishes each page, the callback server flips them to received and writes a JSON file under ./data. On its next pass, the processor reads those files, populates company_pages and company_job_postings, and marks the requests processed. You can watch this live from the Crawler's monitoring tab in the dashboard, which shows each request's state in real time.

What the stored data looks like

After a full run, the destination tables hold clean, impersonal company records. A single processed company page looks like this when you read it back as JSON.

json
{
  "company_name": "Example Robotics",
  "industry": "Industrial Automation",
  "description": "We design warehouse automation systems.",
  "job_postings": [
    {
      "title": "Backend Engineer",
      "location": "Remote, EU",
      "description": "Build and operate our ingestion services."
    }
  ]
}

Every field there is something the company publishes about itself. There is no person, no contact, no profile. That is by design, and it is what keeps the dataset defensible.

Scaling and sending extra context

The architecture scales without structural change: a larger urls.txt means more waiting rows, the Crawler absorbs the queue, and callbacks land as crawls finish. To keep payloads matched to your own context, attach data with the callback_headers parameter when you push a request. The Crawler echoes those headers back on the callback, so you can carry, for example, a batch id without storing it in the URL.

python
raw_headers = f'BATCH-ID:{batch_id}|SOURCE:public-company-page'
encoded_headers = urllib.parse.quote(raw_headers, safe='')
# append &callback_headers={encoded_headers} to the api url

On the receiving side, read them back as ordinary request headers: request.headers.get('BATCH-ID'). For deeper coverage on keeping large runs healthy against defended targets, see our guides on how to scrape websites without getting blocked and on building a scalable web data pipeline.

This is the section to settle before you write production code, not after. LinkedIn's User Agreement and its Prohibited Software and Extensions policy expressly forbid scraping and automated data collection, and LinkedIn enforces those terms. That position holds regardless of how careful your tooling is. The code in this guide makes the technical part work; it does not make scraping LinkedIn compliant with LinkedIn's terms. Read the User Agreement and LinkedIn's robots.txt, and treat both as the boundary for what you do.

The data dimension is just as important. Most LinkedIn content is personal data about identifiable people: names, job histories, headlines, connections, and posts. Under the GDPR in Europe and the CCPA in California, processing personal data needs a lawful basis, and people have rights, including the right to have their data deleted. There is real case law here too: in hiQ Labs v. LinkedIn, US courts examined scraping of public profiles under the Computer Fraud and Abuse Act, but that litigation was narrow, jurisdiction-specific, and did not bless scraping in general or override LinkedIn's contract terms or data-protection law. Legality turns on the data, the method, the jurisdiction, and the agreements you are bound by, so treat blanket claims that "public means fair game" with suspicion.

That is why this tutorial is scoped the way it is. It stores only public, non-personal company information: company names, industries, public descriptions, and public job-posting text that a company publishes about itself. It never builds profiles of individuals, never touches anything behind a login, and never collects member personal data. For any real or commercial need, the correct path is LinkedIn's official APIs and partner programs, which provide sanctioned, structured access within LinkedIn's terms. If your project needs member-level data, that route, or a formal data agreement, is the answer, not a scraper. When in doubt about your specific use, get advice from a qualified lawyer. For more on the public-data approach generally, see our overview of how to scrape LinkedIn.

Recap

Key takeaways

  • Async beats synchronous at volume. Pushing URLs to the Crawler returns a RID in seconds; the slow rendering happens off your machine and results arrive as they finish.
  • The callback server is a thin, guarded receiver. Validate the RID and both status headers, decompress the gzip body, save it, and return immediately so the webhook never blocks.
  • Track state in MySQL. The crawl_requests table walks each request through waiting, received, and processed, which is how receipt and processing stay decoupled.
  • Store public, non-personal data only. The schema and the processor both keep to company-page and public job-posting fields, never member profiles or personal data.
  • Prefer the official path for anything real. LinkedIn's terms restrict scraping and most of its data is personal; use LinkedIn's official APIs and partner programs, and respect GDPR and CCPA.

Frequently Asked Questions (FAQs)

Why use an asynchronous crawler instead of a synchronous script?

A synchronous script blocks on every URL while a defended page renders and returns, so a long list runs mostly idle. The async Crawler accepts your URL, returns a request id immediately, and does the slow rendering and retrying on its own infrastructure, then posts the finished result to your webhook. Pushing a large batch takes seconds, and results stream back as they complete rather than one slow request at a time.

What does the Flask callback server actually do?

It exposes one POST route that the Crawler calls with each finished result. The handler reads the rid header, checks that both PC-Status and Original-Status are 200, confirms the RID matches a request still in waiting status, decompresses the gzip body, saves the payload to disk, and flips the request to received. It returns immediately and never blocks, so it stays responsive even under a burst of callbacks.

Why split receiving and processing into two scripts?

So a slow database write never holds up the webhook. The callback server's only job is to receive and save quickly. A separate scheduled processor reads saved payloads in small batches, extracts the public fields, writes the structured rows, and marks each request processed. Decoupling the two lets the system absorb a large volume of callbacks without backpressure on either side.

Do I need the JavaScript token or the Normal token?

The Normal request (TCP) token. LinkedIn is served by the Normal request Crawler, so you select that Crawler type in the dashboard and use your TCP token in settings.yml. The async Crawler still handles rotation and retries behind the scenes; the token type just tells it which request path to use for the target.

How do I keep the webhook secure?

Accept only POST requests, require a secret token in a custom header or URL parameter that you verify on every call, and confirm the expected rid, PC-Status, and Original-Status headers are present before you trust a payload. Avoid IP allowlisting, since the source addresses rotate and can change without notice. The status and RID checks in the example are a starting point, not the whole story.

Is it safe to store LinkedIn data this way?

Only if you keep to public, non-personal data, as this tutorial does: company names, industries, public descriptions, and public job-posting text. Storing member profiles, names, connections, or other personal data brings LinkedIn's User Agreement and laws like the GDPR and CCPA into play, and is out of scope here. For member-level or commercial use, use LinkedIn's official APIs and partner programs rather than a scraper.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.

Self-serve · No sales call required · Enterprise crawl volumes available