Reputation: 63
I am new to C# so this might be very obvious how to get this to work or way too complex for me but I am trying to setup and scrape a web page using the HtmlAgilityPack. Currently my code compiles but when I write the string I only get 1 result and it happens to be the last result from the li in the ul. The reason for the string split is so I can eventually output the title and description strings into a .csv for further use. I am just unsure what to do next thus, why I am asking for any help/understanding/ideas/thoughts/suggestions that can be offered. Thank you!
private void button1_Click(object sender, EventArgs e)
{
List<string> cities = new List<string>();
//var xpath = "//h2[span/@id='Cities']";
var xpath = "//h2[span/@id='Cities']" + "/following-sibling::ul[1]" + "/li";
WebClient web = new WebClient();
String html = web.DownloadString("http://wikitravel.org/en/Vietnam");
hap.HtmlDocument doc = new hap.HtmlDocument();
doc.LoadHtml(html);
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
string all = node.InnerText;
//splits text between '—', '-' or ' ' into 2 parts
string[] split = all.Split(new char[] { '—', ' ', '-' }, StringSplitOptions.None);
string title;
string description;
int nodeCount;
nodeCount = node.ChildNodes.Count;
if (nodeCount == 2)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText;
}
else if (nodeCount == 4)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText + node.ChildNodes[2].InnerText;
}
else
{
title = "Error";
description = "The node cound was not 2 or 3. Check the div section.";
}
System.IO.StreamWriter write = new System.IO.StreamWriter(@"C:\Users\cbrannin\Desktop\textTest\testText.txt");
write.WriteLine(all);
write.Close();
}
}
}
Upvotes: 0
Views: 1738
Reputation: 134125
One problem is that you're overwriting the output file each time through the loop. You probably want to do this:
using (StreamWriter write = new StreamWriter(@"filename"))
{
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
// do your thing
write.WriteLine(all);
}
}
Also, have you single-stepped this to see if you're getting more than one HtmlNode
from your SelectNode
call?
Finally, I don't see where you're doing anything with the title
or description
. Were you planning to use those for something else?
Upvotes: 2