Introduction to web scraping with Jsoup (in Java)

Ever needed to extract data from a website that doesn't offer an API? You're not alone. Whether you're gathering data for a project, monitoring prices, or just satisfying your curiosity, web scraping is an essential developer skill.

For Java developers, there's a fantastic tool that makes this process incredibly simple: Jsoup.

Jsoup is an open-source Java library designed to parse, manipulate, and clean HTML. It lets you use CSS selectors to find and extract data from a document, just like you would with JavaScript (e.g. jQuery) in a browser.

In this guide, you'll learn how to connect to a website, select the exact elements you need, and extract their data using Jsoup.

Web Scraping Jsoup Prerequisites & Setup

All you need is a Java project. To add Jsoup, simply include the following dependency in your build tool.

Maven:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

Gradle:

implementation 'org.jsoup:jsoup:1.17.2'

Jsoup Web Scraping Examples

1. Fetching and Parsing a Document

First, you need to get the HTML document. Jsoup can fetch a webpage directly from a URL or parse an existing HTML string you already have.

To fetch a live webpage, use the one-liner Jsoup.connect(url).get():

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class JsoupIntro {
    public static void main(String[] args) {
        try {
            // Jsoup connects to the URL and fetches the raw HTML.
            // The .get() method parses it into a Document object.
            Document doc = Jsoup.connect("https://bestscrapingtools.com/").get();
            System.out.println(doc.title()); // Outputs: Example Domain
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

2. Selecting Elements with CSS Selectors

This is where Jsoup shines. Once you have your Document object, you can use the .select() method with familiar CSS selectors to find elements. It's powerful, intuitive, and the core of nearly every Jsoup operation.

Here are some of the most common selectors:

By Tag: Find all <h1> elements.

Elements h1s = doc.select("h1");

By Class: Find any element with the class article-title.

Elements titles = doc.select(".article-title");

By ID: Find the element with the ID main-content.

Element mainContent = doc.selectFirst("#main-content");

By Attribute: Find all <a> tags that have an href attribute.

Elements links = doc.select("a[href]");

Combining Selectors: Find all <a> tags inside a div with the class content.

Elements contentLinks = doc.select("div.content a");

The .select() method returns an Elements object (a list of Element), while .selectFirst() returns a single Element or null.

3. Extracting Data from Elements

Once you've selected your element(s), you need to pull the data out. There are three main methods for this:

.text(): Gets the visible, human-readable text of the element and its children.
.attr("attributeName"): Gets the value of an attribute (e.g., href from a link, src from an image).
.html(): Gets the inner HTML of the element.

Here's how you'd use them together:

// Assume 'doc' is our Document object from the previous step
Element link = doc.selectFirst("a"); // Select the first link

if (link != null) {
    // Get the visible text
    String linkText = link.text(); // "More information..."

    // Get the value of the 'href' attribute
    String url = link.attr("href"); // "http://www.iana.org/domains/example"

    // Get the inner HTML
    String linkHtml = link.html(); // "More information..."

    System.out.println("Text: " + linkText);
    System.out.println("URL: " + url);
}

Putting It All Together: A Practical Example

Let's scrape the titles and URLs of all the books on the first page of books.toscrape.com, a website designed for this purpose.

The goal is to:
1. Connect to the URL.
2. Select the container for each book.
3. For each book, find the <a> tag of its title.
4. Extract the title text and the href attribute.
5. Print the results.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class BookScraper {
    public static void main(String[] args) {
        String url = "https://books.toscrape.com/";

        try {
            Document doc = Jsoup.connect(url).get();

            // Each book is in an <article> with the class "product_pod"
            // The title and link are in an <a> tag within an <h3>
            Elements books = doc.select("article.product_pod h3 a");

            System.out.println("Found " + books.size() + " books.");

            for (Element book : books) {
                // Get the book title from the 'title' attribute of the link
                String title = book.attr("title");

                // Get the book's URL from the 'href' attribute
                // .absUrl() resolves the relative URL (e.g., "catalogue/...")
                // into a full URL.
                String bookUrl = book.absUrl("href");

                System.out.println("Title: " + title);
                System.out.println("URL: " + bookUrl);
                System.out.println("---");
            }

        } catch (IOException e) {
            System.err.println("Error fetching the URL: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

A Quick Note on JavaScript-Rendered Websites

Jsoup's primary limitation is that it does not execute JavaScript. It works directly with the raw HTML returned by the server. This makes it perfect and incredibly fast for static or server-rendered websites.

If the content you need is loaded dynamically with JavaScript after the page loads, Jsoup won't see it. For those more complex cases, you would need headless browser tools like Selenium, Puppeteer or my personal favorite Playwright.

Conclusion

You've now seen how easy it is to start web scraping in Java. Jsoup provides a clean and powerful API to connect to the web, parse any HTML document, and extract the exact data you need.

The core workflow is always the same:
1. Connect and parse the Document.
2. Select the Element(s) you need with CSS selectors.
3. Extract the data using .text() or .attr().

Now, try modifying the example to scrape prices, or point it at a different website. Happy scraping!