Published: 6ย hours, 37ย minutes

Web scraping using an HTML parsers (let's clarify)

Before we start: It's a common point of confusion among newbie web scrapers and laypeople thinking they can web scrape "with" HTML.

Let's clarify: you don't scrape with HTML, you scrape data from HTML.

Think of HTML as the blueprint of a house. You wouldn't use the blueprint itself to knock down a wall or take out a window. Instead, you'd use tools - a sledgehammer, a crowbar - and follow the blueprint to find the exact window you want to remove. In web scraping, HTML is that blueprint, and we need programming tools to extract the data we need.

What is Web Scraping?

Web scraping is simply the automated process of extracting information from websites. Instead of a human manually copying and pasting data from a webpage into a spreadsheet, a script does it automatically. This is incredibly powerful for gathering data for analysis, price monitoring, lead generation, and much more.

The Role of HTML in Web Scraping

Every webpage you visit is built with HTML. It provides the structure for all the content you see - the headings, paragraphs, images, and links. This structure is a goldmine for a web scraper.

HTML uses tags (like <h1>, <p>, <a>), classes, and IDs to organize content. As a scraper, you use these markers as signposts to locate the exact data you want. For example, if you wanted to scrape the titles of all articles on a blog's homepage, you might tell your script: "Go find every <h2> tag with the class article-title and give me the text inside it."

So, while HTML doesn't do anything, understanding it is one of the the most critical skills for successful web scraping.

Side note: Another critical skill is to understand when you DON'T need to scraper HTML, by finding and scraping a websites internal REST API.

So, What Tools Do You Actually Use?

To perform web scraping, you need a programming language to act as your "engine."

Since I'm (mostly) a Python developer, I will show you some examples using Python. However, all of these can be done with pretty much any other popular programming language. (You can even build a scraper with C.)

The process generally involves two main steps and two key Python libraries:

  1. Fetching the Page: First, you need to download the raw HTML code of a webpage, just like your browser does. For this, we use a library called requests. It sends an HTTP request to the website's server and fetches the page content. (In a real scenario you would probably use a Web Scraping API to handle certain issues and complexities for you, that usually come with web scraping. For the simple example below, the requests library will be enough.)
  2. Parsing the HTML: The raw HTML you get is just a long string of text. It's not easy to work with directly. We need a tool to "parse" it into a structured object we can navigate. The most popular library for this is Beautiful Soup. It turns the messy HTML string into a structured tree of objects that you can easily search.

A Simple 3-Step Example (Python)

Let's see it in action. We'll write a simple script to scrape the title and main heading from a webpage.

Step 1: Install the Libraries

If you don't have them installed, open your terminal and run:

pip install requests beautifulsoup4

Step 2: Fetch & Parse the HTML

Next, we'll write the Python code to get a webpage and parse it with Beautiful Soup.

import requests
from bs4 import BeautifulSoup

# The URL of the page we want to scrape
url = 'https://bestscrapingtools.com/'

# Step 1: Fetch the page content using requests
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the HTML content with Beautiful Soup
    soup = BeautifulSoup(response.content, 'html.parser')
    print("Successfully parsed the page!")
else:
    print(f"Error: Failed to retrieve the webpage. Status code: {response.status_code}")

Step 3: Extract the Data

Now that we have our soup object, we can easily find elements within it. Let's grab the page's <title> and its <h1> heading.

# (assuming the code from Step 2 has run successfully)

# Find the <title> tag and get its text
page_title = soup.find('title').get_text()
print(f"Page Title: {page_title}")

# Find the <h1> tag and get its text
main_heading = soup.find('h1').get_text()
print(f"Main Heading: {main_heading}")

Running this full script would give you the following output:

Successfully parsed the page!
Page Title: All Tools for Web Scraping - Find the Best Scraping Tools
Main Heading: Find the best scraping tools for your use-case

And just like that, you've performed a basic web scrape!

Conclusion

So, can you perform "web scraping using HTML"? Not quite. But you absolutely cannot perform web scraping without it.

The key takeaway is this: Web scraping involves using programming tools (like Python with Requests and Beautiful Soup) to parse and extract data from a website's HTML structure.

The better you understand HTML, the more powerful your scraping abilities will be. Now that you've got the concept down, you're ready to start building more complex scrapers.