1 minute read

HTML parsing is essential in web development, especially when you need to extract specific elements like <span> tags from a web page. In this tutorial, we’ll explore how to parse HTML spans using C# and the HtmlAgilityPack library. Let’s dive in!

What is HtmlAgilityPack?

HtmlAgilityPack is a .NET library that facilitates parsing and manipulation of HTML documents. It offers a robust set of tools for navigating the HTML DOM (Document Object Model), making it ideal for web scraping and data extraction tasks.

Getting Started

Before we begin, ensure you have installed the HtmlAgilityPack library in your C# project. You can do this via NuGet Package Manager:

Install-Package HtmlAgilityPack

Parsing HTML Spans

Consider the following HTML content:

<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <div id="content">
        <span class="highlight">Hello</span>
        <span>World</span>
    </div>
</body>
</html>

Let’s extract the text inside the <span> tags.

Step 1: Load the HTML Document

Start by loading the HTML document using HtmlAgilityPack:

using HtmlAgilityPack;

// Load HTML document
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlContent);

Step 2: Select the Span Elements

Next, select the <span> elements using XPath:

var spanNodes = htmlDoc.DocumentNode.SelectNodes("//span");

Step 3: Extract Text from Spans

Iterate through the selected <span> nodes and extract the text:

foreach (var spanNode in spanNodes)
{
    string text = spanNode.InnerText;
    Console.WriteLine(text);
}

Putting It All Together

Here’s the complete code snippet:

using System;
using HtmlAgilityPack;

class Program
{
    static void Main(string[] args)
    {
        string htmlContent = "<!DOCTYPE html><html><head><title>Sample Page</title></head><body><div id=\"content\"><span class=\"highlight\">Hello</span><span>World</span></div></body></html>";

        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(htmlContent);

        var spanNodes = htmlDoc.DocumentNode.SelectNodes("//span");

        if (spanNodes != null)
        {
            foreach (var spanNode in spanNodes)
            {
                string text = spanNode.InnerText;
                Console.WriteLine(text);
            }
        }
    }
}

Conclusion

Parsing HTML spans using C# and HtmlAgilityPack is straightforward and powerful. Whether you’re scraping data from web pages or performing more complex HTML manipulation tasks, HtmlAgilityPack provides the tools you need to get the job done efficiently.

Now that you’ve learned how to parse HTML spans, you can apply these techniques to extract other elements and create robust web scraping applications. Happy coding!

Leave a comment

Your email address will not be published. Required fields are marked *

Loading...