Web scraping using C (starter script included)

I see people asking if they can build a web scraper with C surprisingly often on the internet.

Why surprisingly?

Because it's not really necessary IMO.

There are amazing web scraping apis that will get you started being a productive scraper much faster and most likely more efficiently as well.

The only (commercial) reason(s) I can think of to build a web scraper with C are:

Adding scraping capabilities to an already existing C/C++ project
When very precise control is needed: sockets, timeouts, connection reuse, HTTP/2 multiplexing, HTTP/3, TLS tuning, proxies
Working in constrained environments (e.g. embedded systems) where small/static binaries are required

Basically, if your bottleneck is CPU/memory per request or you need a tiny, controllable fetcher at very high QPS or in constrained environments, C helps. For typical web scraping (especially JS-heavy), it's rarely worth it end-to-end.

And even then, I wonder if it wouldn't be easier to call an intermediary API (one that abstracts scraping mechanisms away in say Python) from C instead of baking a clunky and hard to maintain solution into your C code.

Let's build a tiny web scraper with C

In any case, I think building a tiny scraper with C would be fun, for educational purposes.

Let's get started.

In this example I'm working from Ubuntu 24.04. (But any linux distro should do with almost the same commands.)

Step 1: Install dependencies

sudo apt update
sudo apt install build-essential libcurl4-openssl-dev libxml2-dev

build-essential: Provides gcc (the C compiler) and other essential tools.
libcurl4-openssl-dev: The development files for the cURL library to fetch web content.
libxml2-dev: The development files for a powerful XML/HTML parsing library.

Step 2: The code

Create a file named scraper.c and paste below code into it. This program fetches the HTML from a URL you provide and uses an XPath query to find the title.

// scraper.c
#include <stdio.h>
#include <string.h>
#include <curl/curl.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

// Struct to hold the fetched data from curl
struct MemoryStruct {
  char *memory;
  size_t size;
};

// Callback function for curl to write data into our MemoryStruct
static size_t
WriteMemoryCallback(void *contents, size_t size, size_t nmemb, void *userp) {
  size_t realsize = size * nmemb;
  struct MemoryStruct *mem = (struct MemoryStruct *)userp;

  char *ptr = realloc(mem->memory, mem->size + realsize + 1);
  if(!ptr) {
    printf("not enough memory (realloc returned NULL)\n");
    return 0;
  }

  mem->memory = ptr;
  memcpy(&(mem->memory[mem->size]), contents, realsize);
  mem->size += realsize;
  mem->memory[mem->size] = 0;

  return realsize;
}

// Function to extract the title using XPath
void find_title(const char* buffer) {
    htmlDocPtr doc = htmlReadMemory(buffer, strlen(buffer), NULL, NULL, HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);
    if (doc == NULL) {
        fprintf(stderr, "Failed to parse document\n");
        return;
    }

    xmlXPathContextPtr context = xmlXPathNewContext(doc);
    xmlXPathObjectPtr result = xmlXPathEvalExpression((xmlChar*)"//title", context);

    if (result && result->nodesetval && result->nodesetval->nodeNr > 0) {
        xmlNodePtr titleNode = result->nodesetval->nodeTab[0];
        xmlChar* titleContent = xmlNodeGetContent(titleNode);
        if (titleContent) {
            printf("Title: %s\n", titleContent);
            xmlFree(titleContent);
        }
    } else {
        printf("Title tag not found.\n");
    }

    xmlXPathFreeObject(result);
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);
    xmlCleanupParser();
}

int main(int argc, char *argv[]) {
  if (argc != 2) {
    fprintf(stderr, "Usage: %s <url>\n", argv[0]);
    return 1;
  }

  CURL *curl_handle;
  CURLcode res;

  struct MemoryStruct chunk;
  chunk.memory = malloc(1);
  chunk.size = 0;

  curl_global_init(CURL_GLOBAL_ALL);
  curl_handle = curl_easy_init();
  curl_easy_setopt(curl_handle, CURLOPT_URL, argv[1]);
  curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);
  curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&chunk);
  curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "libcurl-agent/1.0");

  res = curl_easy_perform(curl_handle);

  if(res != CURLE_OK) {
    fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(res));
  } else {
    find_title(chunk.memory);
  }

  curl_easy_cleanup(curl_handle);
  free(chunk.memory);
  curl_global_cleanup();

  return 0;
}

Step 3: Compile and execute

Now, compile the code using gcc. You need to link against the curl and xml2 libraries. The xml2-config tool helps get the correct compiler flags.

gcc scraper.c -o scraper $(xml2-config --cflags --libs) -lcurl

./scraper https://bestscrapingtools.com

You should see the following output:

Title: All Tools for Web Scraping - Find the Best Scraping Tools

Step 4: What have we done?

The above script is supposed to be a starting point for building scrapers in C. There is lots to learn from just this small example!

The Big Limitation: What About JavaScript-Powered Websites?

The cURL and libxml2 combination we used is powerful for fetching and parsing static HTML. It directly requests the HTML content from a server, and that's all it sees.

However, a huge portion of the modern web is dynamic. Websites use JavaScript to load data, render content, and build the page after the initial HTML has been delivered to your browser. This process is often called javascript rendering.

Our C scraper doesn't run JavaScript. It's not a browser.

When you use our script on a JavaScript-heavy site (like a Single Page Application built with React, Vue, or Angular), the HTML that cURL downloads might just be a basic skeleton with <script> tags and a "Loading..." message. The actual product information, articles, or user comments you want to scrape aren't in that initial response. They only appear after a real browser executes the JavaScript, makes further request calls to APIs, and updates the page.

So, How Do You Scrape JavaScript-Powered Sites with C?

This is where the argument for using C for general-purpose web scraping really starts to fall apart. The solutions are complex and often involve delegating the hard work to other tools.

Reverse-Engineering Internal APIs: You can use browser developer tools to monitor the network requests a website makes as it loads. Often, the JavaScript code is fetching the data it needs from a hidden API. If you can figure out how that API works, you can use cURL in your C program to request the data directly from that API, often in a clean JSON format. This is the most efficient method but is brittle - if the site developers change their API, your scraper breaks.
Using a Headless Browser: The most common solution for javascript rendering is to use a headless browser. This is a real web browser, like Chrome or Firefox, that runs in the background without a graphical user interface. You can automate it to navigate to a page, wait for all the JavaScript to execute and the content to load, and then extract the final, fully-rendered HTML.

The problem? Controlling a headless browser from C is not straightforward. The most popular automation tools (like Selenium, Playwright, or Puppeteer) are built for languages like Python, JavaScript, and Java.

To make this work in C, you would typically have to:

Run a separate automation script (e.g., in Python) from your C program using a system() call.
The script would control the headless browser, perform the data extraction, and save the scraped data to a file (like a csv file or simple text file).
Your C program would then read and parse that file.

At this point, you're just using C as a wrapper, which begs the question: why not write the whole web scraper in the language that has the best tools for the job?

Ultimately, while C gives you unparalleled control over low-level networking and memory management, its ecosystem lacks the high-level abstractions needed for efficiently tackling modern, JavaScript-driven web scraping.

Conclusion: Is C the Right Tool for the Job?

Building a web scraper in C, as we've done, is a great exercise. It teaches you about the fundamentals of HTTP requests with curl, low-level memory management, and how to parse HTML using powerful libraries like libxml2. The code works, and for fetching static HTML content, it's incredibly fast and efficient. You can compile this code in various environments, including on Windows using Visual Studio with a package manager like vcpkg to handle dependencies.

However, the "right tool for the job" principle is key in software development. For the vast majority of web scraping tasks, C is a hammer looking for a very specific nail.

As we saw, handling javascript rendering is a major roadblock. But even beyond that, real-world scraping involves challenges that C's ecosystem doesn't easily solve. Think about managing rate limits, handling complex session cookies, or integrating rotating proxies to avoid getting blocked. These features are often built-in or a simple library-install away in other languages.

So, while it's technically possible to scrape products and other data with C, you'll be reinventing the wheel at every step. For practical, end-to-end data extraction projects, you will almost always be more productive using a language like Python with its rich ecosystem of tools designed specifically for this purpose. Use C for what it excels at: performance-critical systems where every CPU cycle and byte of memory counts. For everything else, choose the path of least resistance.