Reputation: 83
I have this code in my main function and I want to parse only the first row of the table (e.g Nov 7, 2017 73.78 74.00 72.32 72.71 17,245,947).
I created a node that concludes only the first row but when I start debugging the node value is null. How can I parse these data and store them for example in a string or in single variables. Is there a way?
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//*[@id=\"prices\"]/table/tbody/tr[2]");
List<List<string>> node = doc.DocumentNode.SelectSingleNode("//*[@id=\"prices\"]/table").Descendants("tr").Skip(1).Where(tr => tr.Elements("td").Count() > 1).Select(tr => tr.Elements("td").Select(td=>td.InnerText.Trim()).ToList()).ToList() ;
Upvotes: 1
Views: 589
Reputation: 3269
It seems that your selection XPath string has errors. Since tbody
is a generated node it should not be included in path:
//*[@id=\"prices\"]/table/tr[2]
While this should read the value HtmlAgilityPack hits another problem malformed html
. All <tr>
and <td>
nodes in parsed text do not have corresponding </tr>
or </td>
closing tags and HtmlAgitilityPack fails to select values from table with malformed rows. Therefore, it is necessary to select in first step the whole table:
//*[@id=\"prices\"]/table
And in the next step either sanitize HTML by adding </tr>
and </td>
closing tags and repeat parsing with corrected table or use extracted string to hand parse it - just extract lines 10 to 15 from table string and split them on >
character. Raw parsing is shown below. Code is tested and working.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
namespace GoogleFinanceDataScraper
{
class Program
{
static void Main(string[] args)
{
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//div[@id='prices']/table");
string outerHtml = node.OuterHtml;
List<String> data = new List<string>();
using(StringReader reader = new StringReader(outerHtml))
{
for(int i = 0; ; i++)
{
var line = reader.ReadLine();
if (i < 9) continue;
else if (i < 15)
{
var dataRawArray = line.Split(new char[] { '>' });
var value = dataRawArray[1];
data.Add(value);
}
else break;
}
}
Console.WriteLine($"{data[0]}, {data[1]}, {data[2]}, {data[3]}, {data[4]}, {data[5]}");
}
}
}
Upvotes: 1