sazr
sazr

Reputation: 25928

Easiest way to extract metatags from downloaded HTML file

I need to parse a webpage for 2 metatag values. I'm unsure what is the most effiecient way to parse the webpage html for the metatag data.

Can I convert the webpages html string to xml then parse for tag of type meta?

WebClient wc = new WebClient();
wc.Headers.Set("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.19 ( .NET CLR 3.5.30729; .NET4.0E)");
string html  = wc.DownloadString(String.Format("http://www.geobytes.com/IpLocator.htm?GetLocation&template=php3.txt&IpAddress={0}", ip));
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(html);   // ERROR HERE: "The 'meta' start tag on line 23 position 2 does not match the end tag of 'head'. Line 26, position 3"
XmlNodeList interNode = xdoc.DocumentElement.SelectNodes("//meta");

I am unfamiliar with all C# libraries, is there a better alternative that would be easier to obtain all the metatags from the returned html

Also I am getting an error when I attempt to parse the html:

The 'meta' start tag on line 23 position 2 does not match the end tag of 'head'. Line 26, position 3

Upvotes: 1

Views: 7816

Answers (2)

Pat
Pat

Reputation: 5282

I'd recommend HTML Agility Pack. It handles malformed HTML well, while giving you the power of XPath to isolate nodes/values.

Your selection would be similar to (using .Net 4.0):

var nodes = doc.DocumentNode.SelectNodes("//meta");

Upvotes: 4

Ry-
Ry-

Reputation: 224903

You could use an HTML parser instead of an XML parser, you could manipulate the string before parsing it as XML, or you could just use regular expressions. They're appropriate for this kind of situation. So, assuming System.Text.RegularExpressions is imported:

Regex metaTag = new Regex(@"<meta name=\"(.+?)\" content=\"(.+?)\">");
Dictionary<string, string> metaInformation = new Dictionary<string, string>();

foreach(Match m in metaTag.Matches(html)) {
    metaInformation.Add(m.Groups[1].Value, m.Groups[2].Value);
}

Now, you can just access any metadata as metaInformation["meta name"].

Upvotes: 1

Related Questions