Trulia is one of the busiest real estate marketplaces in the United States, and its search results carry exactly the structured data that drives price tracking, market research, and investment analysis: the asking price, beds, baths, square footage, the street address, and a link to each property's detail page. For anyone watching a local market, those listing pages are the raw material. The catch is that Trulia renders its results client-side and defends hard against automated traffic, so a plain HTTP request hands you a near-empty shell instead of the listings you came for.
This guide shows you how to scrape Trulia with Python the reliable way. You build a small, runnable scraper that fetches a rendered search results page through the Crawling API, parses each listing with BeautifulSoup, handles pagination, and exports the data to JSON and CSV. The whole walkthrough stays scoped to public property listings, and the legality section near the end is not boilerplate, so read it before you point this at any real volume.
What you will build
A Python script that takes a public Trulia search URL (for example, properties for sale in Los Angeles, CA), retrieves the rendered HTML through the Crawling API, and extracts a structured record for every listing on the page. We pull these fields from each property card:
- Price the asking price shown on the listing.
- Address the street address of the property.
- Beds the number of bedrooms.
- Baths the number of bathrooms.
- Size the floor space in square feet.
- Link the URL of the property's detail page.
Why a plain request fails on Trulia
If you request a Trulia search URL with a bare HTTP client, you get a response with status 200 and almost none of the listing data in the body. Two things work against you. First, Trulia renders much of its results content in the browser with JavaScript, so the initial HTML is a thin shell that only fills in after the page's scripts run. Second, the site flags automated traffic quickly: datacenter IPs and request patterns that do not look like a real browser get challenged, rate limited, or served a captcha before they ever reach the rendered listings.
So a working Trulia scraper needs two things in one request: a browser that actually renders the page, and an IP the platform reads as a real visitor. You can assemble that yourself with a headless browser plus a pool of rotating residential proxies, but stitching those together and keeping them healthy is most of the work. The Crawling API folds both into a single call: you send it the URL with a JavaScript token, it renders the page behind a trusted IP, and it returns finished HTML for you to parse.
Crawlbase offers two token types. The normal token fetches static HTML; the JavaScript (JS) token renders the page in a real browser first. Trulia fills its listing cards client-side, so you need the JS token here. Using the normal token returns the same empty shell a plain fetch would, and there is nothing useful to parse out of it.
Prerequisites
You need a few things in place before writing any code. None of them take long.
Basic Python. You should be comfortable writing and running a Python script and installing packages with pip. If you are new to the language, the Python web scraping guide covers the level this tutorial assumes.
Python 3.8 or later. Confirm your version with python --version, and check that pip is present with pip --version. If you do not have Python, install it from python.org based on your operating system.
A Crawlbase account and JS token. Sign up to get your first 1,000 requests, open your dashboard, and copy your JavaScript (JS) token from the account docs page. Treat the token like a password: it authenticates your requests, so keep it out of version control.
Set up the project
Create a virtual environment so project dependencies stay isolated, then install the three libraries the scraper needs.
python --version python -m venv trulia_env source trulia_env/bin/activate pip install crawlbase beautifulsoup4 pandas
On Windows, activate the environment with trulia_env\Scripts\activate instead of the source line. Three dependencies do the work: crawlbase is the official client for the Crawling API, beautifulsoup4 parses the returned HTML so you can pull out fields by CSS selector, and pandas handles the CSV export at the end. If you have not used the parser before, the BeautifulSoup guide is a good companion to this tutorial.
Step 1: Fetch the rendered search page
Start by getting the finished page. Import the CrawlingAPI class, initialize it with your JS token, and request the search URL. Because Trulia loads its cards asynchronously, pass ajax_wait and page_wait so the API holds until the listings are present. Checking the status before you parse keeps failures loud instead of silent.
from crawlbase import CrawlingAPI api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) options = {"ajax_wait": "true", "page_wait": 8000} def crawl(page_url): response = api.get(page_url, options) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed. pc_status: {response['headers']['pc_status']}") return None if __name__ == "__main__": search_url = "https://www.trulia.com/CA/Los_Angeles/" html = crawl(search_url) print(html[:500] if html else "No HTML returned")
The two wait options matter for a client-rendered target like this. ajax_wait tells the API to wait for asynchronous content to finish loading, and page_wait holds for a fixed number of milliseconds after load so late-rendering cards appear before the page is captured. Eight seconds is a reasonable start for Trulia; raise it if the listings come back empty. The Crawling API returns a pc_status header that reflects the crawl outcome, so check it rather than the raw HTTP code. Run the script with python trulia_scraper.py and you should see real listing markup, not the empty shell a plain request returns. That confirms rendering works before you write a single selector.
Trulia needs a rendered page behind a trusted IP, in one call, and the ajax_wait plus page_wait options you just set are how you wait out its client-side loading. The Crawling API takes a JS token, runs the page in a real browser, rotates through residential IPs server-side, and hands you finished HTML, so you skip running a headless fleet and a proxy pool yourself. Point it at a public search page on the free tier first.
Step 2: Collect the listing cards
Before pulling individual fields, you need the set of property cards on the page. On Trulia, every listing sits inside an li element, and all of those li elements live inside a ul with the attribute data-testid="search-result-list-container". Selecting that container's direct children gives you one node per property.
from bs4 import BeautifulSoup def get_listings(html): soup = BeautifulSoup(html, "html.parser") return soup.select('ul[data-testid="search-result-list-container"] > li')
This returns a list of card elements. Each one is a self-contained scope you can query for that property's price, address, and the rest, which keeps the per-field selectors simple and avoids mixing data between listings.
Step 3: Parse the fields from each card
With a card in hand, pull each field by its data-testid attribute. Trulia is consistent about these attributes across listings, which makes them more stable to target than visual class names. Wrap each lookup so a missing element returns None instead of throwing, since not every listing carries every field (a land-only listing, for instance, may have no bed or bath count).
def text_at(listing, selector): el = listing.select_one(selector) return el.get_text(strip=True) if el else None def parse_listing(listing): link_el = listing.select_one('a[data-testid="property-card-link"]') link = "https://www.trulia.com" + link_el["href"] if link_el else None return { "price": text_at(listing, 'div[data-testid="property-price"]'), "address": text_at(listing, 'div[data-testid="property-address"]'), "beds": text_at(listing, 'div[data-testid="property-beds"]'), "baths": text_at(listing, 'div[data-testid="property-baths"]'), "size": text_at(listing, 'div[data-testid="property-floorSpace"]'), "link": link, }
The text_at helper does the repetitive part: it queries an element and returns its stripped text, or None when the element is absent, so one missing field never crashes the run. Price lives in property-price, the street address in property-address, beds and baths in property-beds and property-baths, and the floor space in property-floorSpace. The detail-page link sits on an a with data-testid="property-card-link", and because that href is relative, you prefix the Trulia origin to get an absolute URL.
Trulia's data-testid values are stable today but not guaranteed. When a field comes back as None across every card, re-inspect a live listing in your browser's dev tools and update the selector. Periodic selector maintenance is normal for any production scraper, not a sign something is broken.
Step 4: Assemble the full script
Now wire the fetch, the card collection, and the field parsing into one runnable script. Fetch the rendered HTML, iterate the cards, parse each into a record, and print the results as JSON.
import json from crawlbase import CrawlingAPI from bs4 import BeautifulSoup api = CrawlingAPI({"token": "YOUR_CRAWLBASE_TOKEN"}) options = {"ajax_wait": "true", "page_wait": 8000} def crawl(page_url): response = api.get(page_url, options) if response["headers"]["pc_status"] == "200": return response["body"].decode("utf-8") print(f"Request failed. pc_status: {response['headers']['pc_status']}") return None def get_listings(html): soup = BeautifulSoup(html, "html.parser") return soup.select('ul[data-testid="search-result-list-container"] > li') def text_at(listing, selector): el = listing.select_one(selector) return el.get_text(strip=True) if el else None def parse_listing(listing): link_el = listing.select_one('a[data-testid="property-card-link"]') link = "https://www.trulia.com" + link_el["href"] if link_el else None return { "price": text_at(listing, 'div[data-testid="property-price"]'), "address": text_at(listing, 'div[data-testid="property-address"]'), "beds": text_at(listing, 'div[data-testid="property-beds"]'), "baths": text_at(listing, 'div[data-testid="property-baths"]'), "size": text_at(listing, 'div[data-testid="property-floorSpace"]'), "link": link, } def main(): search_url = "https://www.trulia.com/CA/Los_Angeles/" html = crawl(search_url) if not html: return listings = get_listings(html) results = [parse_listing(li) for li in listings] print(json.dumps(results, indent=2)) if __name__ == "__main__": main()
What the output looks like
Run the full script with python trulia_scraper.py and you get a clean list of structured records, one per listing on the page, ready to write to JSON, CSV, or a database.
[ { "price": "$4,750,000", "address": "9240 W National Blvd, Los Angeles, CA 90034", "beds": "9bd", "baths": "9ba", "size": "6,045 sqft", "link": "https://www.trulia.com/p/ca/los-angeles/..." }, { "price": "$1,499,999", "address": "245 Windward Ave, Venice, CA 90291", "beds": "4bd", "baths": "3ba", "size": "1,332 sqft", "link": "https://www.trulia.com/p/ca/venice/..." } ]
Listings with missing data come back with null in those fields rather than failing, which is why a land-only or pre-construction listing might show no beds, baths, or size. That is expected, and downstream code should treat any field as optional.
Handling pagination and exporting data
One page is a demo; a real job runs over a whole city. Trulia paginates its search results with a path-based scheme: it appends a sequential page segment to the search URL, so the first page is /1_p/, the second is /2_p/, and so on. Iterating that number walks the result set, and you reuse the same crawl and parsing functions on every page.
import json import time import pandas as pd def scrape_pages(base_url, num_pages): results = [] for page in range(1, num_pages + 1): page_url = f"{base_url}/{page}_p/" html = crawl(page_url) if not html: print(f"Skipping page {page}: no HTML.") continue listings = get_listings(html) if not listings: break results.extend(parse_listing(li) for li in listings) time.sleep(2) return results def export(results): with open("trulia_listings.json", "w") as f: json.dump(results, f, indent=2) pd.DataFrame(results).to_csv("trulia_listings.csv", index=False) print(f"Saved {len(results)} listings to JSON and CSV.") if __name__ == "__main__": base = "https://www.trulia.com/CA/Los_Angeles" data = scrape_pages(base, num_pages=3) export(data)
The time.sleep(2) between pages is deliberate: it paces the run so you are not hammering the site, which is the single most effective habit for staying unblocked. The loop also stops early if a page returns no cards, so you never request past the last result page. The export function writes both trulia_listings.json and trulia_listings.csv; pandas turns the list of dicts into a flat table where each field becomes a column. Adjust the page count and the city slug in base to fit your target market.
Staying unblocked
Even with rendering handled, Trulia watches for scraper-shaped traffic. A few habits keep a run healthy, and they apply to any hard commercial target.
-
Pace your requests. Hammering pages in a tight loop is the fastest way to get throttled or served a captcha. Spread requests out, as the
sleepabove does, and avoid crawling one path at full speed. - Lean on rotation. A pool of residential IPs spreads requests across many real-user addresses so no single one trips a rate limit. The Crawling API handles this for you; if you roll your own stack, this is the part to get right.
-
Read the status codes. A run that starts returning challenges or non-200
pc_statusvalues is telling you the current rate or IP tier is no longer enough. Treat that as a signal to back off, not noise to ignore.
For the broader playbook, see how to scrape websites without getting blocked. If your target sites lean heavily on JavaScript, the guide on crawling JavaScript websites covers the rendering side in more depth.
Is it legal to scrape Trulia?
Whether scraping Trulia is allowed depends on Trulia's terms of service, your jurisdiction, and what you do with the data. Trulia's terms restrict automated access, so scraping can run against those terms regardless of how careful your tooling is. None of the code here changes that; it just makes the technical part work. Read the Trulia Terms of Service and its robots.txt, respect its rate expectations, and treat both as the boundary for what you collect.
A few lines worth holding to. Collect only public property listing data: the price, address, beds, baths, square footage, and the listing link that anyone can see without an account. Avoid anything tied to identifiable individuals, including the contact details of agents, brokers, or owners shown on a card, which fall outside the public-listing scope this guide covers. One detail specific to real estate is worth flagging: much of the underlying property data on sites like Trulia originates from Multiple Listing Service (MLS) feeds, which are typically licensed and carry their own usage restrictions. Republishing that data in bulk can run into those licenses even when the page itself is public.
This guide is deliberately scoped to public search and listing pages because that is the line that keeps the work defensible. It does not cover anything behind a login, saved-search or account data, the personal contact details of individuals, or any attempt to bypass authentication. If your project needs more than public listing fields, the right path is a licensed real estate data feed or an official agreement, not a cleverer scraper. Where a site offers an official API or data partnership, prefer it; it gives you cleaner data and a clear license at the same time.
Key takeaways
- Trulia is client-side rendered. A plain request returns an empty shell, so you must render the page before you parse it.
-
You need rendering and a trusted IP together. The Crawling API with a JS token does both in one call;
ajax_waitandpage_waitcontrol how long it waits for the cards to load. -
Target the stable attributes. Trulia's
data-testidvalues (property-price,property-address,property-beds,property-baths,property-floorSpace) drive the per-field extraction, with each card scoped to oneli. -
Paginate by path and export both formats. Trulia uses
/N_p/page segments; loop them, parse each card, and write the result to JSON and CSV with pandas. - Stay on public data. Respect Trulia's ToS and robots.txt, collect only public listing fields, mind that MLS data is often licensed, and never touch accounts, logins, or the personal contact details of individuals.
Frequently Asked Questions (FAQs)
Why does a plain request return no data from Trulia?
Because Trulia renders its search results client-side with JavaScript. The initial HTML is a shell that only fills in after the page's scripts run in a browser, so a raw HTTP request returns status 200 with the price, beds, baths, and address fields blank. To get real data you have to render the page first, which is what the Crawling API's JS token handles for you.
Do I need the normal token or the JS token for Trulia?
The JS token. The normal token fetches static HTML, which on Trulia is the same empty shell a plain request returns. The JS token renders the page in a real browser before handing back the HTML, so the listing cards are present when BeautifulSoup parses them.
What data can I scrape from a Trulia listing?
Public listing fields: the asking price, the street address, the number of beds and baths, the floor space in square feet, and the link to the detail page. Stay on data that is visible to any visitor without an account, and avoid the personal contact details of agents, brokers, or owners, which fall outside the public-listing scope this guide covers.
How does pagination work on Trulia?
Trulia uses a path-based scheme, appending a sequential page segment to the search URL: /1_p/ for the first page, /2_p/ for the second, and so on. The scrape_pages function above loops that number, fetches each page through the Crawling API, parses the cards, and stops when a page returns no listings.
My selectors return None on every card. What changed?
Almost certainly Trulia's markup. The data-testid values that this scraper targets can change without notice, so selectors that worked last month can break. Re-inspect a live listing in your browser's dev tools and update the selectors. Periodic selector maintenance is normal for any production scraper.
How do I scrape other real estate sites the same way?
The same pattern carries over: render the page, collect the listing cards, and map each public field to a selector. The specifics differ per site, so see the companion guides on how to scrape Zillow and how to scrape Realtor.com, or the rental-focused Apartments.com walkthrough, which reuse this exact fetch-and-parse structure.
Crawl any site at scale, without fighting infrastructure.
Crawlbase handles proxies, fingerprints, and CAPTCHAs so your team ships data pipelines instead of maintaining crawl plumbing. 1,000 requests free, no card required.
