Introduction

The digital landscape offers vast amounts of information, often nestled within the depths of websites. The capacity to expertly navigate these sites and extract pertinent data can significantly aid analysis, research, and strategic business decisions. Enter deep crawling: an advanced form of web scraping. This comprehensive guide will equip you with the skills to proficiently use deep crawling with the Java Spring Boot, harnessing the capabilities of the Crawlbase Crawler library.

Consider the possibility of traversing beyond a website’s primary pages, unveiling information that’s hidden or less accessible. Deep crawling grants this capability. It parallels the thrill of unearthing concealed treasures in a game. Through deep crawling, even the most secluded sections of a website become accessible, revealing data that might otherwise go unnoticed.

What’s even more remarkable is that we’re not just talking theory – we will show you how to do it. Using Java Spring Boot and the Crawlbase Java library, we’ll teach you how to make deep crawling a reality. We’ll help you set up your tools, explain the difference between shallow and deep crawling (it’s not as complicated as it sounds!), and show you how to extract information from different website pages and store them on your side.

Whether you’re a developer, someone who loves working with data, or just curious about the amazing things you can do with web data, this guide is perfect for you. To understand the coding part, you must have a basic understanding of Java Spring Boot and MySQL database.

Table of Contents:

  1. Understanding Deep Crawling: The Gateway to Web Data
  • Unveiling the Power of Deep Crawling
  • Differentiating Between Shallow and Deep Crawling
  • Exploring the Scope and Significance of Deep Crawling
  1. Understanding Crawlbase Crawler: How It Works
  • What Is Crawlbase Crawler?
  • The Workflow: How Crawlbase Crawler Operates
  • The Benefits: Why Choose Crawlbase Crawler
  1. Setting the Stage: Preparing Your Environment
  • Installing Java on Ubuntu and Windows
  • Installing Spring Tool Suite (STS) on Ubuntu and Windows
  • Installing MySQL on Ubuntu and Windows
  1. Simplifying Spring Boot Project Setup with Spring Initializr
  2. Importing the Starter Project into Spring Tool Suite
  3. Understanding Your Project’s Blueprint: A Peek into Project Structure
  • The Heart of the Project: Maven and pom.xml
  • Meet the Libraries: Dependencies Unleashed
  • Understand the Project Structure
  1. Starting the Coding Journey
  • Integrating Crawlbase Dependency
  • Integrating Crawlbase Dependency
  • Setting Up the Database
  • Planning the Models
  • Designing the Model Files
  • Establishing Repositories for Both Models
  • Planing APIs and Request Body Mapper Classes
  • Creating a ThreadPool to optimize webhook
  • Creating the Controllers and their Services
  • Updating application.properties File
  1. Running the Project and Initiating Deep Crawling
  • Making our Webhook Public
  • Creating the Crawlbase Crawler
  • Pushing URL to the Crawler
  1. Analyzing Output in the Database: Unveiling Insights
  2. Conclusion
  3. Frequently Asked Questions

1. Understanding Deep Crawling: The Gateway to Web Data

In the world of the internet, where information is abundant and diverse, extracting valuable insights from websites has become essential for individuals and businesses. Deep crawling, often called web scraping, has emerged as the gateway to unlocking a wealth of data buried within the intricate web of online platforms. In this section, we’ll delve into the essence of deep crawling, its fundamental distinctions from shallow crawling, and the broader scope of its significance in the world of data extraction.

Unveiling the Power of Deep Crawling

At its core, deep crawling is the sophisticated process of systematically browsing through websites and extracting specific information from multiple levels of interconnected pages. Unlike shallow crawling, which typically limits itself to skimming the surface-level content, deep crawling traverses the complex hierarchy of links, uncovering valuable data that resides deeper within websites. This process enables us to harvest diverse information, from product prices and user reviews to financial data and news articles.

Deep crawling empowers us to access a treasure trove of structured and unstructured data that would otherwise remain hidden. By diligently navigating through the intricate pathways of the web, we can collect data that holds the potential to inform business decisions, fuel research endeavors, and drive innovation.

Differentiating Between Shallow and Deep Crawling

To fully appreciate the power of deep crawling, one must understand the difference between shallow and deep crawling. Shallow crawling is like skimming the surface of a pond, capturing only the visible elements. It primarily focuses on indexing a limited portion of a website’s content, often limited to the homepage or a few top-level pages. While shallow crawling offers a snapshot of a website’s basic features, it needs to access the vast amount of hidden data within its structure.

On the other hand, deep crawling is akin to diving deep into the ocean, exploring its intricate ecosystems. It involves thoroughly examining the entire website, following links from page to page, and unearthing valuable information that might be buried within sub-directories or behind authentication barriers. Deep crawling’s ability to extract comprehensive data makes it an invaluable tool for businesses seeking to gather competitive intelligence, researchers aiming to analyze trends, and developers building data-driven applications.

Shallow vs Deep Crawl

Exploring the Scope and Significance of Deep Crawling

The scope of deep crawling extends far beyond data extraction; it’s a gateway to understanding the web’s dynamics and uncovering insights that drive decision-making. From e-commerce platforms that want to monitor product prices across competitors’ sites to news organizations aiming to analyze sentiment across articles, the applications of deep crawling are as diverse as the data it reveals.

In the realm of research, deep crawling serves as a foundation for data-driven analyses that shed light on emerging trends, user behaviors, and patterns in online content consumption. Its significance also stretches to legal and regulatory compliance, as organizations must navigate the ethical considerations of data extraction and respect websites’ terms of use.

2. Understanding Crawlbase Crawler: How It Works

At the heart of efficient web data extraction lies the Crawlbase Crawler, a tool designed to transform how you gather and utilize website information. But what exactly is Crawlbase Crawler, and how does it work its magic?

What Is Crawlbase Crawler?

Crawlbase Crawler is a dynamic web data extraction tool that offers a modern and intelligent approach to collecting valuable information from websites. Unlike traditional scraping methods that involve constant polling, Crawlbase Crawler operates asynchronously. This means it can independently process requests to extract data, delivering it in real-time without the need for manual monitoring.

The Workflow: How Crawlbase Crawler Operates

Crawlbase Crawler operates on a seamless and efficient workflow that can be summarized in a few key steps:

  1. URLs Submission: As a user, you initiate the process by submitting URLs to the Crawlbase Crawler using the Crawling API.
  2. Request Processing: The Crawler receives these requests and processes them asynchronously. This means it can handle multiple requests simultaneously without any manual intervention.
  3. Data Extraction: The Crawler visits the specified URLs, extracts the requested data, and packages it for delivery.
  4. Webhook Integration: Crawlbase Crawler integrates with webhook instead of requiring manual polling. This webhook serves as a messenger that delivers the extracted data directly to your server’s endpoint in real time.
  5. Real-Time Delivery: The extracted data is delivered to your server’s webhook endpoint as soon as it’s available, enabling immediate access without delays.
  6. Fresh Insights: By receiving data in real-time, you gain a competitive edge in making informed decisions based on the latest web content.

The Benefits: Why Choose Crawlbase Crawler

Crawlbase Crawler’s distinctive approach offers several key benefits:

  1. Efficiency: Asynchronous processing eliminates the need for continuous monitoring, freeing up your resources for other tasks.
  2. Real-Time Insights: Receive data as soon as it’s available, allowing you to stay ahead of trends and changes.
  3. Streamlined Workflow: Webhook integration replaces manual polling, simplifying the data delivery process.
  4. Timely Decision-Making: Instant access to freshly extracted data empowers timely and data-driven decision-making.

To harness the prowess of the Crawler, you must create it within your Crawlbase account dashboard. You can opt for the TCP or JavaScript Crawler based on your specific needs. The TCP Crawler is ideal for static pages, while the JavaScript Crawler suits content generated via JavaScript, as in JavaScript-built pages or dynamically rendered browser content. Read here to know more about Crawlbase Crawler.

During the creation, it will ask you to give your webhook address. So, we will create it after we successfully create a webhook in our Spring Boot project. In the upcoming section, we’ll dive deeper into the coding stuff and develop the required component to complete our project.

3. Setting the Stage: Preparing Your Environment

Before we embark on our journey into deep crawling, it’s important to set the stage for success. This section guides you through the essential steps to ensure your development environment is ready to tackle the exciting challenges ahead.

Installing Java on Ubuntu and Windows

Java is the backbone of our development process, and we have to make sure that it’s available on our system. If you don’t have Java installed on your system, you can follow the steps below as per your operating system.

Installing Java on Ubuntu:

  1. Open the Terminal by pressing Ctrl + Alt + T.
  2. Run the following command to update the package list:
1
sudo apt update
  1. Install the Java Development Kit (JDK) by running:
1
sudo apt install default-jdk
  1. Verify the JDK installation by typing:
1
java -version

Installing Java on Windows:

  1. Visit the official Oracle website and download the latest Java Development Kit (JDK).
  2. Follow the installation wizard’s prompts to complete the installation. Once installed, you can verify it by opening the Command Prompt and typing:
1
java -version

Installing Spring Tool Suite (STS) on Ubuntu and Windows:

Spring Tool Suite (STS) is an integrated development environment (IDE) specifically designed for developing applications using the Spring Framework, a popular Java framework for building enterprise-level applications. STS provides tools, features, and plugins that enhance the development experience when working with Spring-based projects; follow the steps below to install them.

  1. Visit the official Spring Tool Suite website at spring.io/tools.
  2. Download the appropriate version of Spring Tool Suite for your operating system (Ubuntu or Windows).

On Ubuntu:

  1. After downloading, navigate to the directory where the downloaded file is located in the Terminal.
  2. Extract the downloaded archive:
1
2
# Replace <version> and <platform> as per the archive name
tar -xvf spring-tool-suite-<version>-<platform>.tar.gz
  1. Move the extracted directory to a location of your choice :
1
2
# Replace <bundle> as per extracted folder name
mv sts-<bundle> /your_desire_path/

On Windows:

  1. Run the downloaded installer and follow the on-screen instructions to complete the installation.

Installing MySQL on Ubuntu and Windows

Setting up a reliable database management system is paramount to kick-start your journey into deep crawling and web data extraction. MySQL, a popular open-source relational database, provides the foundation for securely storing and managing the data you’ll gather through your crawling efforts. Here’s a step-by-step guide on how to install MySQL on both Ubuntu and Windows platforms:

Installing MySQL on Ubuntu:

  1. Open a terminal and run the following commands to ensure your system is up-to-date:
1
2
sudo apt update
sudo apt upgrade
  1. Run the following command to install the MySQL server package:
1
sudo apt install mysql-server
  1. After installation, start the MySQL service:
1
sudo systemctl start mysql.service
  1. Check if MySQL is running with the command:
1
sudo systemctl status mysql

Installing MySQL on Windows:

  1. Visit the official MySQL website and download the MySQL Installer for Windows.
  2. Run the downloaded installer and choose the “Developer Default” setup type. This will install MySQL Server and other related tools.
  3. During installation, you’ll be asked to configure MySQL Server. Set a strong root password and remember it.
  4. Follow the installer’s prompts to complete the installation.
  5. After installation, MySQL should start automatically. You can also start it manually from Windows’s “Services” application.

Verifying MySQL Installation:

Regardless of your platform, you can verify the MySQL installation by opening a terminal or command prompt and entering the following command:

1
mysql -u root -p

You’ll be prompted to enter the MySQL root password you set during installation. If the connection is successful, you’ll be greeted with the MySQL command-line interface.

Now that you have Java and STS ready, you’re all set for the next phase of your deep crawling adventure. In the upcoming step, we’ll guide you through creating a Spring Boot starter project, setting the stage for your deep crawling endeavors. Let’s dive into this exciting phase of the journey!

4. Simplifying Spring Boot Project Setup with Spring Initializr

Imagine setting up a Spring Boot project is like navigating a tricky maze of settings. But don’t worry, Spring Initializr is here to help! It’s like having a smart helper online that makes the process way easier. You could do it manually, but that’s like a puzzle that takes a lot of time. Spring Initializr comes to the rescue by making things smoother right from the start. Follow the following Steps to create Spring Boot Project with Spring Initializr.

  1. Go to the Spring Initializr Website

Open your web browser and go to the Spring Initializr website. You can find it at start.spring.io.

  1. Choose Your Project Details

Here’s where you make important choices for your project. You have to chose the type of the Project and Language you are going to use. We have to choose Maven as a Project type and JAVA as its language. For Spring Boot version, go for a stable one (like 3.1.2). Then, add details about your project, like its name and what it’s about. It’s easy – just follow the example in the picture.

  1. Add the Cool Stuff

Time to add special features to your project! It’s like giving it superpowers. Include Spring Web (that’s important for Spring Boot projects), Spring Data JPA, and the MySQL Driver if you’re going to use a database. Don’t forget Lombok – it’s like a magic tool that saves time. We’ll talk more about these in the next parts of the blog.

  1. Get Your Project

After picking all the good stuff, click “GENERATE.” Your Starter project will download as a zip file. Once it’s done, open the zip file to see the beginning of your project.

Spring Initializr Settings

By following these steps, you’re ensuring your deep crawling adventure starts smoothly. Spring Initializr is like a trusty guide that helps you set up. In the upcoming section, we’ll guide you through importing your project into the Spring Tool Suite you’ve installed. Get ready to kick-start this exciting phase of your deep crawling journey!

5. Importing the Starter Project into Spring Tool Suite

Alright, now that you’ve got your Spring Boot starter project all setup and ready to roll, the next step is to import it into Spring Tool Suite (STS). It’s like inviting your project into a cozy workspace where you can work your magic. Here’s how you do it:

  1. Open Spring Tool Suite (STS)

First things first, fire up your Spring Tool Suite. It’s your creative hub where all the coding and crafting will happen.

  1. Import the Project

Navigate to the “File” menu and choose “Import.” A window will pop up with various options – select “Existing Maven Projects” and click “Next.”

  1. Choose Project Directory

Click the “Browse” button and locate the directory where you unzipped your Starter project. Select the project’s root directory and hit “Finish.”

  1. Watch the Magic

Spring Tool Suite will work its magic and import your project. It appears in the “Project Explorer” on the left side of your workspace.

  1. Ready to Roll

That’s it! Your Starter project is now comfortably settled in Spring Tool Suite. You’re all set to start building, coding, and exploring.

Import in STS

Bringing your project into Spring Tool Suite is like opening the door to endless possibilities. Now you have the tools and space to make your project amazing. The following section will delve into the project’s structure, peeling back the layers to reveal its components and inner workings. Get ready to embark on a journey of discovery as we unravel what lies within!

6. Understanding Your Project’s Blueprint: A Peek into Project Structure

Now that your Spring Boot starter project is comfortably nestled within Spring Tool Suite (STS) let’s take a tour of its inner workings. It’s like getting to know the layout of your new home before you start decorating it.

The Heart of the Project: Maven and pom.xml

At the core of your project lies a powerful tool called Maven. Think of Maven as your project’s organizer – it manages libraries, dependencies, and builds. The file named pom.xml is where all the project-related magic happens. It’s like the blueprint that tells Maven what to do and what your project needs. As in our case, currently, we will have this in the pom.xml project.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.1.2</version>
<relativePath/> <!-- lookup parent from repository -->
</parent>
<groupId>com.example</groupId>
<artifactId>crawlbase</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>Crawlbase Crawler With Spring Boot</name>
<description>Demo of using Crawlbase Crawler with Spring Boot and how to do Deep Crawling</description>
<properties>
<java.version>17</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>com.mysql</groupId>
<artifactId>mysql-connector-j</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<configuration>
<excludes>
<exclude>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</build>
</project>

Meet the Libraries: Dependencies Unleashed

Remember those special features you added when creating the project? They’re called dependencies, like magical tools that make your project more powerful. You were actually adding these libraries when you included Spring Web, Spring Data JPA, MySQL Driver, and Lombok from the Spring Initializr. You can see those in the pom.xml above. They bring pre-built functionality to your project, saving you time and effort.

  • Spring Web: This library is your ticket to building Spring Boot web applications. It helps with things like handling requests and creating web controllers.
  • Spring Data JPA: This library is your ally if you’re dealing with databases. It simplifies database interactions and management, letting you focus on your project’s logic.
  • MySQL Driver: When you’re using MySQL as your database, this driver helps your project communicate with the database effectively.
  • Lombok: Say goodbye to repetitive code! Lombok reduces the boilerplate code you usually have to write, making your project cleaner and more concise.

Understand the Project Structure

As you explore your project’s folders, you’ll notice how everything is neatly organized. Your Java code goes into the src/main/java directory, while resources like configuration files and static assets reside in the src/main/resources directory. You’ll also find the application.properties file here – it’s like the control center of your project, where you can configure settings.

Project Structure

In the src/main/java directory we will find a package containing a Java Class with main function. This file act as the starting point while execution of Spring Boot Project. In our case, we will have CrawlbaseApplication.java file with following code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
package com.example.crawlbase;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
// Add this to enable asynchronous in the project
@EnableAsync
public class CrawlbaseApplication {

public static void main(String[] args) {
SpringApplication.run(CrawlbaseApplication.class, args);
}

}

Now that you’re familiar with the essentials, you can confidently navigate your project’s landscape. Before starting with the coding, we’ll dive into Crawlbase and try to understand how it works and how we can use it in our project. So, get ready to uncover the true power of crawler.

7. Starting the Coding Journey

With your development environment set up, it’s time to dive into coding. This section outlines the essential steps to create controllers, services, repositories, and update properties files. Before getting into the nitty-gritty of coding, we need to lay the groundwork and introduce key dependencies that will empower our project.

Integrating Crawlbase Dependency

As we’re harnessing the power of Crawlbase Crawler, we need to make sure we can seamlessly access it in our Java project. Thankfully, Crawlbase offers the Crawlbase Java library, simplifying the integration process. To incorporate it into our project, all it takes is adding the corresponding Maven dependency in the project’s pom.xml file:

1
2
3
4
5
<dependency>
<groupId>com.crawlbase</groupId>
<artifactId>crawlbase-java-sdk-pom</artifactId>
<version>1.0</version>
</dependency>

After adding this dependency, a quick Maven Install will ensure that the Crawlbase Java library is downloaded from the Maven repository and ready for action.

Integrating JSoup Dependency

Given that we’ll be diving deep into HTML content, having a powerful HTML parser at our disposal is crucial. Enter JSoup, a robust and versatile HTML parser for Java. It offers convenient methods for navigating and manipulating HTML structures. To leverage its capabilities, we need to include the JSoup library in our project through another Maven dependency:

1
2
3
4
5
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>

Setting Up the Database

Before we proceed further, let’s lay the foundation for our project by creating a database. Follow these steps to create a MySQL database:

  1. Open the MySQL Console: If you’re using Ubuntu, launch a terminal window. On Windows, open the MySQL Command Line Client or MySQL Shell.
  2. Log In to MySQL: Enter the following command and input your MySQL root password when prompted:
1
mysql -u root -p
  1. Create a New Database: Once logged in, create a new database with the desired name:
1
2
# Replace database_name with your chosen name
CREATE DATABASE database_name;

Planning the Models

Before diving headfirst into model planning, let’s understand what the crawler returns when URLs are pushed to it and what response we receive at our webhook. When we send URLs to the crawler, it responds with a Request ID, like this:

1
{ "rid": "1e92e8bff32c31c2728714d4" }

Once the crawler has effectively crawled the HTML content, it forwards the output to our webhook. The response will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
Headers:
"Content-Type" => "text/plain"
"Content-Encoding" => "gzip"
"Original-Status" => 200
"PC-Status" => 200
"rid" => "The RID you received in the push call"
"url" => "The URL which was crawled"

Body:
The HTML of the page

// Body will be gzip encoded

So, taking this into the account, we can consider the following database structure.

Database Schema

We don’t need to create the database tables directly as we will make our Spring Boot Project to automatically initialize the tables when we run it. We will make Hibernate to do this for us.

Designing the Model Files

With the groundwork laid in the previous section, let’s delve into the creation of our model files. In the com.example.crawlbase.models package, we’ll craft two essential models: CrawlerRequest.java and CrawlerResponse.java. These models encapsulate the structure of our database tables, and to ensure efficiency, we’ll employ Lombok to reduce boilerplate code.

CrawlerRequest Model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package com.example.crawlbase.models;

import jakarta.persistence.CascadeType;
import jakarta.persistence.Entity;
import jakarta.persistence.FetchType;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.Id;
import jakarta.persistence.OneToOne;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
public class CrawlerRequest {

@Id
@GeneratedValue
private Long id;

private String url;
private String type;
private Integer status;
private String rid;

@OneToOne(mappedBy = "crawlerRequest", cascade = CascadeType.ALL, fetch = FetchType.LAZY)
private CrawlerResponse crawlerResponse;

}

CrawlerResponse Model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
package com.example.crawlbase.models;

import jakarta.persistence.Column;
import jakarta.persistence.Entity;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.Id;
import jakarta.persistence.JoinColumn;
import jakarta.persistence.OneToOne;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Entity
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
public class CrawlerResponse {

@Id
@GeneratedValue
private Long id;

private Integer pcStatus;
private Integer originalStatus;

@Column(columnDefinition = "LONGTEXT")
private String pageHtml;

@OneToOne
@JoinColumn(name = "request_id")
private CrawlerRequest crawlerRequest;

}

Establishing Repositories for Both Models

Following the creation of our models, the next step is to establish repositories for seamless interaction between our project and the database. These repository interfaces serve as essential connectors, leveraging the JpaRepository interface to provide fundamental functions for data access. Hibernate, a powerful ORM tool, handles the underlying mapping between Java objects and database tables.

Create a package com.example.crawlbase.repositories and within it, create two repository interfaces, CrawlerRequestRepository.java and CrawlerResponseRepository.java.

CrawlerRequestRepository Interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
package com.example.crawlbase.repositories;

import java.util.List;
import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.data.jpa.repository.Query;
import org.springframework.data.repository.query.Param;

import com.example.crawlbase.models.CrawlerRequest;

public interface CrawlerRequestRepository extends JpaRepository<CrawlerRequest, Long> {

// Find by column Name and value
List<CrawlerRequest> findByRid(String value);
}

CrawlerResponseRepository Interface:

1
2
3
4
5
6
7
8
package com.example.crawlbase.repositories;

import org.springframework.data.jpa.repository.JpaRepository;
import com.example.crawlbase.models.CrawlerResponse;

public interface CrawlerResponseRepository extends JpaRepository<CrawlerResponse, Long> {

}

Planing APIs and Request Body Mapper Classes

Harnessing the Crawlbase Crawler involves designing two crucial APIs: one for pushing URLs to the crawler and another serving as a webhook. To begin, let’s plan the request body structures for these APIs.

Push URL request body:

1
2
3
4
5
6
{
"urls": [
"http://www.3bfluidpower.com/",
.....
]
}

As for the webhook API’s request body, it must align with the Crawler’s response structure, as discussed earlier. You can read more about it here.

In line with this planning, we’ll create two request mapping classes in the com.example.crawlbase.requests package:

CrawlerWebhookRequest Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package com.example.crawlbase.requests;

import lombok.Builder;
import lombok.Data;

@Data
@Builder
public class CrawlerWebhookRequest {

private Integer pc_status;
private Integer original_status;
private String rid;
private String url;
private String body;

}

ScrapeUrlRequest Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package com.example.crawlbase.requests;

import lombok.Builder;
import lombok.Data;

@Data
@Builder
public class CrawlerWebhookRequest {

private Integer pc_status;
private Integer original_status;
private String rid;
private String url;
private String body;

}

Creating a ThreadPool to optimize webhook

If we don’t optimize our webhook to handle large amount of requests, it will cause hidden problems. This is where we can use multi-threading. In JAVA, ThreadPoolTaskExecutor is used to manage a pool of worker threads for executing asynchronous tasks concurrently. This is particularly useful when you have tasks that can be executed independently and in parallel.

Create a new package com.example.crawlbase.config and create ThreadPoolTaskExecutorConfig.java file in it.

ThreadPoolTaskExecutorConfig Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package com.example.crawlbase.config;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;

@Configuration
public class ThreadPoolTaskExecutorConfig {

@Bean(name = "taskExecutor")
public ThreadPoolTaskExecutor taskExecutor() {
int cores = Runtime.getRuntime().availableProcessors();
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(cores);
executor.setMaxPoolSize(cores);
executor.setQueueCapacity(Integer.MAX_VALUE);
executor.setThreadNamePrefix("Async-");
executor.initialize();
return executor;
}
}

Creating the Controllers and their Services

Since we need two APIs and there business logic is much different, we will implement them in the separate controllers. Separate Controllers mean we will have separate services. Let’s first create a MainController.java and its service as MainService.java. We will implement the API you push URL on the Crawler in this controller.

Create a new package com.example.crawlbase.controllers for controllers and com.example.crawlbase.services for services in the project.

MainController Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
package com.example.crawlbase.controllers;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import com.example.crawlbase.requests.ScrapeUrlRequest;
import com.example.crawlbase.services.MainService;

import lombok.extern.slf4j.Slf4j;

@RestController
@RequestMapping("/scrape")
@Slf4j
public class MainController {

@Autowired
private MainService mainService;

@PostMapping("/push-urls")
public ResponseEntity<Void> pushUrlsToCawler(@RequestBody ScrapeUrlRequest request) {
try {
if(!request.getUrls().isEmpty()) {
// Asynchronously Process The Request
mainService.pushUrlsToCrawler(request.getUrls(), "parent");
}
return ResponseEntity.status(HttpStatus.OK).build();
} catch (Exception e) {
log.error("Error in pushUrlsToCrawler function: " + e.getMessage());
return ResponseEntity.status(HttpStatus.BAD_REQUEST).build();
}
}

}

As you can see above we have created a restful API “@POST /scrape/push-urls” which will be responsible for handling the request for pushing URLs to the Crawler.

MainService Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
package com.example.crawlbase.services;

import java.util.*;
import com.crawlbase.*;
import com.example.crawlbase.models.CrawlerRequest;
import com.example.crawlbase.repositories.CrawlerRequestRepository;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

import lombok.extern.slf4j.Slf4j;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;

@Slf4j
@Service
public class MainService {

@Autowired
private CrawlerRequestRepository crawlerRequestRepository;

// Inject the values from the properties file
@Value("${crawlbase.token}")
private String crawlbaseToken;
@Value("${crawlbase.crawler}")
private String crawlbaseCrawlerName;

private final ObjectMapper objectMapper = new ObjectMapper();

@Async
public void pushUrlsToCrawler(List<String> urls, String type) {
HashMap<String, Object> options = new HashMap<String, Object>();
options.put("callback", "true");
options.put("crawler", crawlbaseCrawlerName);
options.put("callback_headers", "type:" + type);

API api = null;
CrawlerRequest req = null;
JsonNode jsonNode = null;
String rid = null;

for(String url: urls) {
try {
api = new API(crawlbaseToken);
api.get(url, options);
jsonNode = objectMapper.readTree(api.getBody());
rid = jsonNode.get("rid").asText();
if(rid != null) {
req = CrawlerRequest.builder().url(url).type(type).
status(api.getStatusCode()).rid(rid).build();
crawlerRequestRepository.save(req);
}
} catch(Exception e) {
log.error("Error in pushUrlsToCrawler function: " + e.getMessage());
}
}
}

}

In the above service, we created an Async method to process the request asynchronously. pushUrlsToCrawler function uses the Crawlbase library to push URLs to the Crawler and then save the received RID and other attributes into the crawler_request table. To push URLs to the Crawler, we must use the “crawler” and “callback” parameters. We are also using “callback_headers” to send a custom header “type,” which we will use to know whether the URL is the one we gave or it is scraped while deep crawling. You can read more about these parameters and many others here.

Now we have to implement the API we will use a a webhook. For this create WebhookController.java in the com.example.crawlbase.controllers package and WebhookService.java in the com.example.crawlbase.services package.

WebhookController Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package com.example.crawlbase.controllers;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestHeader;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import com.example.crawlbase.services.WebhookService;

import lombok.extern.slf4j.Slf4j;

@RestController
@RequestMapping("/webhook")
@Slf4j
public class WebhookController {

@Autowired
private WebhookService webhookService;

@PostMapping("/crawlbase")
public ResponseEntity<Void> crawlbaseCrawlerResponse(@RequestHeader HttpHeaders headers, @RequestBody byte[] compressedBody) {
try {
if(!headers.getFirst(HttpHeaders.USER_AGENT).equalsIgnoreCase("Crawlbase Monitoring Bot 1.0") &&
"gzip".equalsIgnoreCase(headers.getFirst(HttpHeaders.CONTENT_ENCODING)) &&
headers.getFirst("pc_status").equals("200")) {
// Asynchronously Process The Request
webhookService.handleWebhookResponse(headers, compressedBody);
}
return ResponseEntity.status(HttpStatus.OK).build();
} catch (Exception e) {
log.error("Error in crawlbaseCrawlerResponse function: " + e.getMessage());
return ResponseEntity.status(HttpStatus.BAD_REQUEST).build();
}
}

}

In the above code, you can see that we have created a restful API, “@POST /webhook/crawlbase”, which will be responsible for receiving the response from the output request from the Crawler. You can notice in the code that we ignore the calls with USER_AGENT as “Crawlbase Monitoring Bot 1.0” because Crawler Monitoring Bot requests this user agent to check if the callback is live and accessible. So, no need to process this request. Just return a successful response to the Crawler.

While working with Crawlbase Crawler, Your server webhook should…

  • Be publicly reachable from Crawlbase servers
  • Be ready to receive POST calls and respond within 200ms
  • Respond within 200ms with a status code 200, 201 or 204 without content

WebhookService Class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
package com.example.crawlbase.services;

import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpHeaders;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;

import com.example.crawlbase.models.CrawlerRequest;
import com.example.crawlbase.models.CrawlerResponse;
import com.example.crawlbase.repositories.CrawlerRequestRepository;
import com.example.crawlbase.repositories.CrawlerResponseRepository;
import com.example.crawlbase.requests.CrawlerWebhookRequest;

import lombok.extern.slf4j.Slf4j;

@Slf4j
@Service
public class WebhookService {

@Autowired
private CrawlerRequestRepository crawlerRequestRepository;
@Autowired
private CrawlerResponseRepository crawlerResponseRepository;
@Autowired
private MainService mainService;

@Async("taskExecutor")
public void handleWebhookResponse(HttpHeaders headers, byte[] compressedBody) {
try {
// Unzip the gziped body
GZIPInputStream gzipInputStream = new GZIPInputStream(new ByteArrayInputStream(compressedBody));
InputStreamReader reader = new InputStreamReader(gzipInputStream);

// Process the uncompressed HTML content
StringBuilder htmlContent = new StringBuilder();
char[] buffer = new char[1024];
int bytesRead;
while ((bytesRead = reader.read(buffer)) != -1) {
htmlContent.append(buffer, 0, bytesRead);
}

// The HTML String
String htmlString = htmlContent.toString();

// Create the request object
CrawlerWebhookRequest request = CrawlerWebhookRequest.builder()
.original_status(Integer.valueOf(headers.getFirst("original_status")))
.pc_status(Integer.valueOf(headers.getFirst("pc_status")))
.rid(headers.getFirst("rid"))
.url(headers.getFirst("url"))
.body(htmlString).build();

// Save CrawlerResponse Model
List<CrawlerRequest> results = crawlerRequestRepository.findByRid(request.getRid());
CrawlerRequest crawlerRequest = !results.isEmpty() ? results.get(0) : null;
if(crawlerRequest != null) {
// Build CrawlerResponse Model
CrawlerResponse crawlerResponse = CrawlerResponse.builder().pcStatus(request.getPc_status())
.originalStatus(request.getOriginal_status()).pageHtml(request.getBody()).crawlerRequest(crawlerRequest).build();
crawlerResponseRepository.save(crawlerResponse);
}

// Only Deep Crawl Parent Url
if(headers.getFirst("type").equalsIgnoreCase("parent")) {
deepCrawlParentResponse(request.getBody(), request.getUrl());
}
} catch (Exception e) {
log.error("Error in handleWebhookResponse function: " + e.getMessage());
}

}

private void deepCrawlParentResponse(String html, String baseUrl) {
Document document = Jsoup.parse(html);
Elements hyperLinks = document.getElementsByTag("a");
List<String> links = new ArrayList<String>();

String url = null;
for (Element hyperLink : hyperLinks) {
url = processUrl(hyperLink.attr("href"), baseUrl);
if(url != null) {
links.add(url);
}
}

mainService.pushUrlsToCrawler(links, "child");
}

private String processUrl(String href, String baseUrl) {
try {
if (href != null && !href.isEmpty()) {
baseUrl = normalizeUrl(baseUrl);
String processedUrl = normalizeUrl(href.startsWith("/") ? baseUrl + href : href);
if (isValidUrl(processedUrl) &&
!processedUrl.replace("http://", "").replace("https://", "").equals(baseUrl.replace("http://", "").replace("https://", "")) &&
// Only considering the URLs with same hostname
Objects.equals(new URI(processedUrl).getHost(), new URI(baseUrl).getHost())) {

return processedUrl;
}
}
} catch (Exception e) {
log.error("Error in processUrl function: " + e.getMessage());
}
return null;
}

private boolean isValidUrl(String string) {
String urlRegex = "((http|https)://)(www.)?"
+ "[a-zA-Z0-9@:%._\\+~#?&//=]"
+ "{2,256}\\.[a-z]"
+ "{2,6}\\b([-a-zA-Z0-9@:%"
+ "._\\+~#?&//=]*)";
Pattern pattern = Pattern.compile(urlRegex);
Matcher matcher = pattern.matcher(string);
return matcher.matches();
}

private String normalizeUrl(String url) throws URISyntaxException {
url = url.replace("//www.", "//");
url = url.split("#")[0];
url = url.endsWith("/") ? url.substring(0, url.length() - 1) : url;
return url;
}
}

The WebhookService class serves a crucial role in efficiently handling webhook responses and orchestrating the process of deep crawling. When a webhook response is received, the handleWebhookResponse method is invoked asynchronously from the WebhookController’s crawlbaseCrawlerResponse function. This method starts by unzipping the compressed HTML content and extracting the necessary metadata and HTML data. The extracted data is then used to construct a CrawlerWebhookRequest object containing details like status, request ID (rid), URL, and HTML content.

Next, the service checks if there’s an existing CrawlerRequest associated with the request ID. If found, it constructs a CrawlerResponse object to encapsulate the pertinent response details. This CrawlerResponse instance is then persisted in the database through the CrawlerResponseRepository.

However, what sets this service apart is its ability to facilitate deep crawling. If the webhook response type indicates a “parent” URL, the service invokes the deepCrawlParentResponse method. In this method, the HTML content is parsed using the Jsoup library to identify hyperlinks within the page. These hyperlinks, representing child URLs, are processed and validated. Only URLs belonging to the same hostname and adhering to a specific format are retained.

The MainService is then employed to push these valid child URLs into the crawling pipeline, using the “child” type as a flag. This initiates a recursive process of deep crawling, where child URLs are further crawled, expanding the exploration to multiple levels of interconnected pages. In essence, the WebhookService coordinates the intricate dance of handling webhook responses, capturing and preserving relevant data, and orchestrating the complicated process of deep crawling by intelligently identifying and navigating through parent and child URLs.

Updating application.properties File

In the final step, we will configure the application.properties file to define essential properties and settings for our project. This file serves as a central hub for configuring various aspects of our application. Here, we need to specify database-related properties, Hibernate settings, Crawlbase integration details, and logging preferences.

Ensure that your application.properties file includes the following properties:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Database Configuration
spring.datasource.url=jdbc:mysql://localhost:3306/<database_name>
spring.datasource.username=<MySQL_username>
spring.datasource.password=<MySQL_password>

spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.jpa.hibernate.ddl-auto=update

# Crawlbase Crawler Integration
crawlbase.token=<Your_Crawlbase_Normal_Token>
crawlbase.crawler=<Your_TCP_Crawler_Name>

logging.file.name=logs/<log-file-name>.log

You can find your Crawlbase TCP (normal) token here. Remember to replace the placeholders in above code with your actual values, as determined in the previous sections. This configuration is vital for establishing database connections, synchronizing Hibernate operations, integrating with the Crawlbase API, and managing logging for your application. By carefully adjusting these properties, you’ll ensure seamless communication between different components and services within your project.

8. Running the Project and Initiating Deep Crawling

With the coding phase complete, the next step is to set the project in motion. Spring Boot, at its core, employs an embedded Apache Tomcat build that caters to smooth transitions from development to production and integrates seamlessly with prominent platforms-as-a-service. Executing the project within Spring Tool Suite (STS) involves a straightforward process:

  • Right-click the project in the STS project structure tree.
  • Navigate to the “Run As” menu. and
  • Select “Spring Boot App”.

This action triggers the project to launch on localhost, port 8080.

Spring Boot Server Running

Making the Webhook Publicly Accessible

Since the webhook we’ve established resides locally on our system at localhost, port 8080, we need to grant it public accessibility. Enter Ngrok, a tool that creates secure tunnels, granting remote access without the need to manipulate network settings or router ports. Ngrok is executed on port 8080 to render our webhook publicly reachable.

Ngrok Server

Ngrok conveniently provides a public Forwarding URL, which we will later utilize with Crawlbase Crawler.

Creating the Crawlbase Crawler

Recall our earlier discussion on Crawlbase Crawler creation via the Crawlbase dashboard. Armed with a publicly accessible webhook through Ngrok, crafting the crawler becomes effortless.

Create New Crawler

In the depicted instance, the ngrok forwarding URL collaborates with the webhook address “/webhook/crawlbase” as a callback. This fusion yields a fully public webhook address. We christen our crawler as “test-crawler,” a name that will be incorporated into the project’s application.properties file. The selection of TCP Crawler aligns with our choice. Upon hitting the “Create Crawler” button, the crawler takes shape, configured according to the specified parameters.

Initiating Deep Crawling by Pushing URLs

Following the creation of the crawler and the incorporation of its name into the application.properties file, we’re poised to interact with the “@POST /scrape/push-urls” API. Through this API, we send URLs to the crawler, triggering the deep crawl process. Let’s exemplify this by pushing the URL http://www.3bfluidpower.com/.

Postman Request

With this proactive approach, we set the wheels of deep crawling in motion, utilizing the power of Crawlbase Crawler to delve into the digital landscape and unearth valuable insights.

9. Analyzing Output in the Database: Unveiling Insights

Upon initiating the URL push to the Crawler, a Request ID (RID) is returned—a concept elaborated on in prior discussions—marking the commencement of the page’s crawling process on the Crawler’s end. This strategic approach eliminates the wait time typically associated with the crawling process, enhancing the efficiency and effectiveness of data acquisition. Once the Crawler concludes the crawling, it seamlessly transmits the output to our webhook.

The Custom Headers parameter, particularly the “type” parameter, proves instrumental in our endeavor. Its presence allows us to distinguish between the URLs we pushed and those discovered during deep crawling. When the type is designated as “parent,” the URL stems from our submission, prompting us to extract fresh URLs from the crawled HTML and subsequently funnel them back into the Crawler—this time categorized as “child.” This strategy ensures that only the URLs we introduced undergo deep crawling, streamlining the process.

In our current scenario, considering a singular URL submission to the Crawler, the workflow unfolds as follows: upon receiving the crawled HTML, the webhook service stores it in the crawler_response table. Subsequently, the deep crawling of this HTML takes place, yielding newly discovered URLs that are then pushed to the Crawler.

crawler_request Table:

Crawler Request Table

As you can see above, at our webhook service, we found 16 new URLs from the HTML of the page who’s URL we push to the Crawler in the previous section, which we save in the database with “type: parent”. We push all the new URLs found to the crawler to deep crawl the given URL. Crawler will crawl all of them and push the output on our webhook. We are saving the crawled HTML in the crawler_response table.

crawler_response Table:

Crawler Response Table

As you can see in the above table view, all the information we get at our webhook is saved in the table. Once you have the HTML at your webhook, we can scrape any information we want. This detailed process highlights how deep crawling works, allowing us to discover important information from web content.

10. Conclusion

Our expedition into the world of web data extraction has reached its destination, and as we wrap up this blog, let’s reflect on the remarkable tools and insights we’ve uncovered.

Starting with the Spring Boot starter project, we’ve paved the way for efficient development using Spring Initializr. Navigating Spring Tool Suite, we’ve gained confidence in crafting our coding endeavors. Armed with libraries like Jsoup and Lombok, we’ve set the stage for seamless data manipulation.

However, the true game-changer has been the Crawlbase Crawler. This dynamic tool has revolutionized web scraping. The Crawler has made data collection efficient and precise by working asynchronously and delivering real-time data through webhooks. With its workflow, we’re empowered to make informed decisions using the freshest insights from the digital realm.

As we conclude, we stand ready for innovation. Equipped with Spring Boot, Crawlbase Crawler, and efficient coding, we’re armed to turn data into a strategic asset. Whether it’s e-commerce insights, market analysis, or staying current with dynamic content, these tools are our allies in success.

In the fast-evolving digital world, our approaches to data must evolve as well. The wisdom gained here will guide us to new horizons. With the Spring Boot, the ingenuity of Crawlbase Crawler, and coding skills for efficiency, we’re poised for excellent results.

Thank you for joining us on this journey. You can find the full source code of the project on GitHub here. May your web data endeavors be as transformative as the tools and knowledge you’ve gained here. As the digital landscape continues to unfold, remember that the power to innovate is in your hands.

11. Frequently Asked Questions

Q: What is Crawlbase Crawler, and how it works?

Crawlbase Crawler is a dynamic web data extraction tool that operates asynchronously, transforming the way you gather information from websites. Its unique approach allows it to process requests independently and deliver real-time data through webhooks, eliminating the need for constant manual oversight.

Crawlbase Crawler follows a seamless workflow. Users submit URLs via the Crawling API, which are then processed asynchronously. The Crawler extracts the requested data and delivers it to your server’s endpoint using webhooks, ensuring you receive freshly extracted information in real-time.

Q: What are the benefits of using Crawlbase Crawler?

Crawlbase Crawler presents a revolutionary approach to web data extraction, leveraging asynchronous processing to maximize efficiency. Seamlessly integrating with webhooks eliminates the need for constant monitoring, ensuring real-time data delivery. This novel workflow provides a competitive edge, allowing immediate access to freshly extracted data for swift decision-making.

With Crawlbase Crawler, you unlock the potential for streamlined operations, gaining insights into the dynamic web landscape. This tool’s distinct advantages include efficient resource utilization, real-time data availability, simplified workflows through webhook integration, and the capacity to harness the latest web content for informed choices.

Q: How does Crawlbase Crawler support deep crawling?

Crawlbase Crawler supports deep crawling by returning crawled HTML to the user. Users can then extract new URLs from the HTML and submit them back to the Crawler for further exploration. This iterative process allows you to progressively navigate through linked pages and gather comprehensive data from multiple layers of a website, enabling a thorough understanding of its content structure and connections.

Q: Do I need to use JAVA to use the Crawler?

No, you do not need to use JAVA exclusively to use the Crawlbase Crawler. The Crawler provides multiple libraries for various programming languages, enabling users to interact with it using their preferred language. Whether you are comfortable with Python, JavaScript, Java, Ruby, or other programming languages, Crawlbase has you covered. Additionally, Crawlbase offers APIs that allow users to access the Crawler’s capabilities without relying on specific libraries, making it accessible to a wide range of developers with different language preferences and technical backgrounds. This flexibility ensures that you can seamlessly integrate the Crawler into your projects and workflows using the language that best suits your needs.