Parsing HTML Span Using C# and HtmlAgilityPack: A Step-by-Step Guide
HTML parsing is essential in web development, especially when you need to extract specific elements like <span>
tags from a web page. In this tutorial, we’ll explore how to parse HTML spans using C# and the HtmlAgilityPack library. Let’s dive in!
What is HtmlAgilityPack?
HtmlAgilityPack is a .NET library that facilitates parsing and manipulation of HTML documents. It offers a robust set of tools for navigating the HTML DOM (Document Object Model), making it ideal for web scraping and data extraction tasks.
Getting Started
Before we begin, ensure you have installed the HtmlAgilityPack library in your C# project. You can do this via NuGet Package Manager:
Install-Package HtmlAgilityPack
Parsing HTML Spans
Consider the following HTML content:
<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div id="content">
<span class="highlight">Hello</span>
<span>World</span>
</div>
</body>
</html>
Let’s extract the text inside the <span>
tags.
Step 1: Load the HTML Document
Start by loading the HTML document using HtmlAgilityPack:
using HtmlAgilityPack;
// Load HTML document
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlContent);
Step 2: Select the Span Elements
Next, select the <span>
elements using XPath:
var spanNodes = htmlDoc.DocumentNode.SelectNodes("//span");
Step 3: Extract Text from Spans
Iterate through the selected <span>
nodes and extract the text:
foreach (var spanNode in spanNodes)
{
string text = spanNode.InnerText;
Console.WriteLine(text);
}
Putting It All Together
Here’s the complete code snippet:
using System;
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
string htmlContent = "<!DOCTYPE html><html><head><title>Sample Page</title></head><body><div id=\"content\"><span class=\"highlight\">Hello</span><span>World</span></div></body></html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlContent);
var spanNodes = htmlDoc.DocumentNode.SelectNodes("//span");
if (spanNodes != null)
{
foreach (var spanNode in spanNodes)
{
string text = spanNode.InnerText;
Console.WriteLine(text);
}
}
}
}
Conclusion
Parsing HTML spans using C# and HtmlAgilityPack is straightforward and powerful. Whether you’re scraping data from web pages or performing more complex HTML manipulation tasks, HtmlAgilityPack provides the tools you need to get the job done efficiently.
Now that you’ve learned how to parse HTML spans, you can apply these techniques to extract other elements and create robust web scraping applications. Happy coding!
Leave a comment