Automate Real Estate Data Extraction

Public property listings are some of the most useful data on the open web. Price, beds, baths, square footage, and address sit right on every results page, and tracking how those numbers move over time tells you where a market is heating up, where rents are softening, and which listings are mispriced. The catch is that a single snapshot is rarely enough. Real estate data is only valuable when it is fresh, which means you have to collect it again and again, on a schedule, without sitting at a terminal to babysit each run.

This guide shows you how to automate real estate data extraction with JavaScript and Node.js. You build a runnable workflow that pulls public listings through the Crawling API, parses price, beds, baths, sqft, address, and link for each property, and then automates the whole thing three ways: a scheduled cron run for steady collection, the async Crawler with a webhook for high volume, and a simple store for the results. If you only need a one-off scrape of a single site, the per-site guides linked below are a better fit. This one is about running the job on repeat. Everything here stays scoped to public listing data, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.

What you will build

A Node.js workflow that takes a public property-search URL, retrieves the rendered HTML through the Crawling API, extracts a structured record for every listing on the page, and runs on a schedule. For each property we pull these fields:

Price the listed price as shown on the card, like "$2,400/mo" or "$525,000".
Beds the number of bedrooms.
Baths the number of bathrooms.
Sqft the floor area in square feet.
Address the street address shown on the listing.
Link the URL to the individual listing page.

On top of the parser, you wire three layers of automation: a scheduled run, an async high-volume path, and a JSON store with a timestamp on every batch so you can diff one collection against the next.

Why a plain request fails on real estate sites

If you request a property-search URL with a bare HTTP client, you rarely get the listing grid back. Two things work against you. First, most modern real estate portals render their results in the browser with JavaScript, so the initial HTML is a near-empty shell until the page's scripts run. Second, these sites flag automated traffic aggressively: datacenter IPs and request patterns that do not look like a real browser get challenged with a CAPTCHA, rate-limited, or blocked before they reach the rendered listings.

So a working real estate scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work, and it gets worse once you run on a schedule and the volume climbs. The Crawling API folds both into a single call: you send it the URL, it renders the page behind a trusted IP, and it returns finished HTML for you to parse with cheerio.

Render budget

JavaScript-heavy listing pages need the JavaScript token so the API runs a real browser before returning HTML. Crawlbase gives you up to 20,000 free requests to start, you pay only for successful requests, and a normal request and a JavaScript request draw different credit amounts. Start on the free tier and confirm the page renders before you scale up.

Prerequisites

You need a few things in place before writing any code. None of them take long.

Basic JavaScript and Node.js. You should be comfortable writing and running a Node script and installing packages with npm. Just enough to read functions is plenty.

Node.js 16 or later. Confirm your version with node --version. If you do not have it, install it from the Node.js website or through a version manager like nvm.

A Crawlbase account and token. Sign up, open your dashboard, and copy your token. The free tier gives you up to 20,000 requests with no card. Treat the token like a password: it authenticates your requests, so keep it out of version control.

Set up the project

Create a project folder, initialize it, and install the libraries the workflow needs.

bash

node --version

mkdir real-estate-automation && cd real-estate-automation
npm init -y

npm install crawlbase cheerio node-cron express

Four dependencies do the work: crawlbase is the official Node client for the Crawling API and the async Crawler, cheerio parses the returned HTML with a jQuery-style API so you can pull out fields by CSS selector, node-cron runs the scrape on a schedule, and express receives the webhook the async Crawler posts back to. Create a file named scraper.js in this folder and add the code from the steps below.

Step 1: Fetch a rendered listings page

Start by getting the finished page. Import the CrawlingAPI class, initialize it with your token, and request a public search URL. Because the page is JavaScript-rendered, pass { ajax_wait: true, page_wait: 3000 } so the API waits for the listing cards to load before it returns. Checking the status code before you parse keeps failures loud instead of silent.

javascript

const { CrawlingAPI } = require('crawlbase');

const api = new CrawlingAPI({ token: 'YOUR_CRAWLBASE_TOKEN' });

const listingsURL = 'https://www.example-realty.com/homes-for-rent/ca/los-angeles';

api
  .get(listingsURL, { ajax_wait: true, page_wait: 3000 })
  .then((response) => {
    if (response.statusCode === 200) {
      console.log(response.body.slice(0, 500));
    }
  })
  .catch((error) => console.error('API request error:', error));

Run the script with node scraper.js and you should see real listing markup at the top of the body, not a stripped-down shell. That confirms rendering works before you write a single selector. Swap listingsURL for whichever public search you want to track: a city, a neighborhood, a price band, the URL the site shows when you filter a search in the browser is the one you crawl.

Crawlbase Crawling API

That first request just returned a fully rendered listings page without a headless browser or a proxy on your side. The Crawling API runs the page in a real browser, rotates through residential IPs server-side, and handles the CAPTCHAs real estate portals throw at scrapers, so you get finished HTML from one call, and the same call holds up when a cron job fires it every morning. Point it at a public search on the free tier first.

Start free

Step 2: Parse each listing with cheerio

With rendered HTML in hand, load it into cheerio and walk the listing cards. A results page lays each property out in a repeating container, so you select every card, then read price, beds, baths, sqft, address, and link from inside it. The exact selectors below come from a typical card layout; you will adjust them to the site you target by inspecting one card in your browser's dev tools. Reading each field defensively keeps one missing value from crashing the run, and parsing the price into a number gives you something to sort and compare on.

javascript

const cheerio = require('cheerio');

function parseListings(html) {
  const $ = cheerio.load(html);
  const properties = [];

  $('li[data-testid="listing-card"]').each((i, el) => {
    const card = $(el);

    const price = card.find('span.listing-card-price').text().trim();
    const priceValue = parseFloat(price.replace(/[^0-9.]/g, ''));

    const beds = card.find('p:contains("Beds") strong').first().text().trim();
    const baths = card.find('p:contains("Baths") strong').first().text().trim();
    const sqft = card.find('p:contains("Sq Ft") strong').first().text().trim();

    const address = card.find('a.listing-card-address').text().trim();
    const href = card.find('a.listing-card-address').attr('href');
    const link = href
      ? new URL(href, 'https://www.example-realty.com').href
      : '';

    if (price && address) {
      properties.push({ price, priceValue, beds, baths, sqft, address, link });
    }
  });

  return properties;
}

A few details keep this faithful to the page. Each card sits inside a repeating li container, the price comes from the price span and is also parsed into a numeric priceValue so you can sort cheapest first, and beds, baths, and sqft are read from labeled blocks with a :contains() selector that survives small reorderings. The address anchor doubles as the listing link, so one find gives you both, and the href is resolved to an absolute URL so it works outside the page. Only cards with a price and an address get pushed, which drops the promo tiles real estate sites mix into a results grid.

Selectors drift

Listing-card class names and data-testid values are generated and change without notice. Treat the selectors above as a starting template, not a contract. When a field comes back empty, re-inspect the live page in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.

Step 3: Assemble the scrape and store the results

Now wire the fetch and the parse into one function that returns clean records, then write each batch to disk with a timestamp. Keeping every run in its own timestamped file is what lets you diff one collection against the next and watch prices move.

javascript

const fs = require('fs');

async function scrape(url) {
  const response = await api.get(url, { ajax_wait: true, page_wait: 3000 });
  if (response.statusCode !== 200) {
    console.error(`Request failed: ${response.statusCode}`);
    return [];
  }
  return parseListings(response.body);
}

function save(properties) {
  properties.sort((a, b) => a.priceValue - b.priceValue);
  const timestamp = new Date().toISOString().replace(/[:.]/g, '-');
  const file = `listings_${timestamp}.json`;
  fs.writeFileSync(file, JSON.stringify(properties, null, 2));
  console.log(`Saved ${properties.length} properties to ${file}`);
  return file;
}

async function runOnce() {
  const url = 'https://www.example-realty.com/homes-for-rent/ca/los-angeles';
  const properties = await scrape(url);
  if (properties.length) save(properties);
}

module.exports = { scrape, save, runOnce, parseListings };

Paste the parseListings function from Step 2 and the API setup from Step 1 into the same file so scrape can call them. Run node -e "require('./scraper').runOnce()" and you get a sorted, timestamped JSON file of every public listing on the page. That is the unit of work the automation layers below schedule and repeat.

Automate it with a schedule

A one-off scrape captures a single moment. Real estate data is only useful when it is current, so the first automation layer is a recurring run. With node-cron you keep the process alive and fire runOnce on a cron expression. The example below runs every morning at 7am.

javascript

const cron = require('node-cron');
const { runOnce } = require('./scraper');

// Minute Hour DayOfMonth Month DayOfWeek
cron.schedule('0 7 * * *', async () => {
  console.log(`Scheduled run at ${new Date().toISOString()}`);
  try {
    await runOnce();
  } catch (error) {
    console.error('Scheduled run failed:', error.message);
  }
});

console.log('Scheduler started. Waiting for the next run...');

Start it with node schedule.js and leave it running on a small server or a container. Each morning it scrapes the search and drops a fresh timestamped file, building a history you can diff for price changes, new listings, and properties that dropped off the market. If you prefer not to keep a process alive, the same runOnce call works from a system cron entry or any job runner; node-cron is just the in-process option. The pattern is identical to the one in the guide on how to automate Amazon scraping, where a schedule turns a single scrape into a tracking pipeline.

Scale up with the async Crawler and a webhook

A scheduled loop is fine for a handful of searches. Once you track dozens of cities or thousands of listing pages, waiting on each synchronous request in turn gets slow, and a long-running process is a fragile place to hold that much work. The async Crawler is built for this: you push URLs to it, Crawlbase fetches and renders them on its own infrastructure, and it posts each finished page back to a webhook you host. Your code stops waiting on requests and just handles results as they arrive.

First, stand up a small endpoint that receives the callbacks. The Crawler posts the rendered HTML to it, so you parse and store right there in the handler.

javascript

const express = require('express');
const { parseListings, save } = require('./scraper');

const app = express();
app.use(express.text({ type: '*/*', limit: '10mb' }));

app.post('/crawlbase-webhook', (req, res) => {
  const html = req.body;
  const properties = parseListings(html);
  if (properties.length) save(properties);
  res.sendStatus(200);
});

app.listen(3000, () => console.log('Webhook listening on :3000'));

Then push your search URLs to the Crawler, naming the webhook as the callback. The Crawler queues each one, renders it, and calls your endpoint with the result, so you can submit a large batch and let the responses stream back.

javascript

const { Crawler } = require('crawlbase');

const crawler = new Crawler({ token: 'YOUR_CRAWLBASE_TOKEN' });

const searches = [
  'https://www.example-realty.com/homes-for-rent/ca/los-angeles',
  'https://www.example-realty.com/homes-for-rent/ca/san-diego',
  'https://www.example-realty.com/homes-for-rent/ca/san-francisco',
];

for (const url of searches) {
  crawler.post(
    url,
    { callback: 'true', callback_url: 'https://your-server.com/crawlbase-webhook' },
    { ajax_wait: true, page_wait: 3000 }
  );
}

The split is the point. The Crawler absorbs the slow, blocking part, the rendering and the retries, on Crawlbase's side, and your webhook only ever runs the fast parse-and-store step. That decoupling is what lets the same workflow go from three searches to three thousand without your process holding every request open. Your webhook does need a public URL during development; a tunneling tool exposes localhost:3000 so the Crawler can reach it.

What the output looks like

Whether a record comes from the scheduled run or the async webhook, every batch is the same shape: one object per listing, sorted cheapest first, with the price, beds, baths, sqft, address, and link.

json

[
  {
    "price": "$2,400/mo",
    "priceValue": 2400,
    "beds": "2",
    "baths": "1",
    "sqft": "850",
    "address": "1234 Sunset Blvd, Los Angeles, CA 90026",
    "link": "https://www.example-realty.com/property/1234-sunset-blvd"
  },
  {
    "price": "$3,150/mo",
    "priceValue": 3150,
    "beds": "3",
    "baths": "2",
    "sqft": "1,320",
    "address": "88 Maple Ave, Los Angeles, CA 90042",
    "link": "https://www.example-realty.com/property/88-maple-ave"
  }
]

Because each run lands in its own timestamped file, comparing two batches is a set difference on the link field for new and removed listings, and a join on link with a priceValue compare for price changes. That diff is the whole reason to automate: a single scrape tells you the market today, a scheduled history tells you where it is going. If you want the same records in a spreadsheet, the legacy version of this workflow wrote straight to Excel with ExcelJS, and adding that export back is a few lines on top of the JSON store.

Staying unblocked at volume

Even with rendering handled, real estate portals watch for scraper-shaped traffic, and a schedule that fires every day makes patterns easy to spot. A few habits keep a run healthy.

Pace your requests. Spread fetches out rather than hammering pages in a tight loop. When you scrape many searches, add a delay between them or lean on the async Crawler, which queues and paces the work for you.
Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a limit or a CAPTCHA. The Crawling API and the async Crawler handle this for you; if you roll your own stack, this is the part to get right.
Read the status codes. A run that starts returning challenges or non-200 responses is telling you the current rate or IP tier is no longer enough. Treat that as signal to back off, not noise to ignore.

For the broader playbook, see how to scrape websites without getting blocked. If you want a single-site walkthrough instead of this automation focus, the dedicated guides on how to scrape Zillow and how to scrape Redfin cover those portals' specific card layouts and pagination.

Is it legal to scrape real estate data?

Whether scraping a real estate site is allowed depends on that site's terms of service, your jurisdiction, and what you do with the data. Most portals restrict automated access in their terms, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the site's Terms of Use and its robots.txt, and treat both as the boundary for what you collect and how often you request it. A schedule makes rate discipline more important, not less.

Keep the work to public listing data only: the price, beds, baths, sqft, address, and listing link that anyone can see on a public results page without an account. Do not collect personal data about agents, owners, or buyers beyond what a public business listing already shows, and do not build profiles of individuals from it. GDPR and CCPA apply the moment personal data enters the picture, and a public street address attached to a named person can qualify, so lean toward the property facts and away from the people. Do not redistribute a portal's copyrighted media, such as listing photos, as if it were your own, and do not touch anything behind a login.

One point specific to this industry: much of the richest property data comes from the MLS, and MLS feeds are almost always licensed, not free for the taking. If your project needs comprehensive, accurate, redistributable listing data, the right path is a licensed feed or an official API, not a scraper. Several large portals run partner programs or developer APIs for exactly this reason. Use those when you need volume, guaranteed structure, or commercial rights. This guide is deliberately scoped to public listings on public search pages because that is the line that keeps the work defensible.

Recap

Key takeaways

Real estate data is only valuable when it is fresh. A single scrape is a snapshot; automating the run on a schedule turns it into a history you can diff for price moves and new listings.
Render behind a trusted IP before you parse. Portals render listings client-side and block hard, so a plain request returns an empty shell or a CAPTCHA; the Crawling API renders the page and rotates residential IPs in one call.
cheerio extracts the fields. Select every listing card, then read price, beds, baths, sqft, address, and link, parsing the price into a number so you can sort and compare; expect generated class names to drift.
Scale with the async Crawler and a webhook. Push URLs to the Crawler, let it render on Crawlbase's side, and have it post finished pages to your endpoint so the workflow goes from three searches to thousands without holding requests open.
Stay on public data. Respect each site's ToS and robots.txt, keep to public property facts and away from personal data, and prefer a licensed MLS feed or an official API for comprehensive or commercial use.

Frequently Asked Questions (FAQs)

How do I automate real estate data extraction on a schedule?

Wrap your scrape in a function and call it from a scheduler. The simplest in-process option is node-cron: give it a cron expression like 0 7 * * * and it fires your runOnce function every morning. Each run drops a fresh timestamped file, so you accumulate a history you can diff. If you would rather not keep a Node process alive, the same function works from a system cron entry or any job runner.

When should I use the async Crawler instead of the Crawling API?

Use the synchronous Crawling API when you scrape a handful of searches and want the result back in the same call. Switch to the async Crawler when you track dozens of cities or thousands of listing pages: you push the URLs, Crawlbase renders them on its own infrastructure, and it posts each finished page to a webhook you host. That decoupling keeps your process from waiting on every slow request in turn.

Why does a plain request return incomplete data from real estate sites?

Because most portals render their listing grid client-side with JavaScript and challenge automated traffic with CAPTCHAs. A raw HTTP request from a datacenter IP usually returns an empty shell or a block page rather than the property cards. To get a complete page you have to render it behind a trusted IP, which is what the Crawling API handles for you when you pass the JavaScript options.

What fields can I extract from a public property listing?

The public facts on a results card: price, number of beds, number of baths, square footage, the street address, and the link to the full listing page. This guide parses exactly those. Stay away from personal data about agents, owners, or buyers, and away from copyrighted media like listing photos, both of which carry legal and licensing constraints public property facts do not.

My selectors return empty values. What changed?

Almost certainly the site's markup. Listing-card class names and data-testid values are generated and change without notice, so selectors that worked last month can break, especially on a schedule that runs unattended. Re-inspect a live card in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper.

Is it better to scrape or to use an MLS feed?

For comprehensive, accurate, redistributable listing data, a licensed MLS feed or an official portal API is the right tool, since MLS data is almost always licensed rather than free to take. Scraping public search pages is appropriate for tracking public listing facts at modest volume, research, and price-movement analysis, scoped to public data and within each site's terms. Match the source to the use: public facts and light volume favor a scraper, comprehensive or commercial use favors a licensed feed.

Henry Obinna

Freelance Content Writer

Freelance content writer who contributed web scraping and open-source tooling guides to the Crawlbase blog.

Start Building

Crawl any site at scale, without fighting infrastructure.

Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. Up to 20,000 requests free, no card required.

Get a free API key →Read the docs

Self-serve · No sales call required · Enterprise crawl volumes available

What you will build

Why a plain request fails on real estate sites

Prerequisites

Set up the project

Step 1: Fetch a rendered listings page

Step 2: Parse each listing with cheerio

Step 3: Assemble the scrape and store the results

Automate it with a schedule

Scale up with the async Crawler and a webhook

What the output looks like

Staying unblocked at volume

Is it legal to scrape real estate data?

Key takeaways

Frequently Asked Questions (FAQs)

How do I automate real estate data extraction on a schedule?

When should I use the async Crawler instead of the Crawling API?

Why does a plain request return incomplete data from real estate sites?

What fields can I extract from a public property listing?

My selectors return empty values. What changed?

Is it better to scrape or to use an MLS feed?

Crawl any site at scale, without fighting infrastructure.

Continue Reading

How to Scrape Google People Also Ask: full PAA extraction guide

Introducing the New Crawlbase Dashboard: a cleaner control center

13 Tips to Master Data Crawling: crawls that do not break

The infrastructure brief, in your inbox.