Reputation: 477
I would like to get the data from this website and put them into a dictionary.
Basically these are prices and quantities for some financial instruments.
I have this source code for the page (here is just an extract of the whole text):
<tr>
<td class="quotesMaxTime1414148558" id="notation115602071"><span>4,000.00</span></td>
<td><span>0</span></td>
<td class="icon red"><span id="domhandler:8.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PERFORMANCE_PCT.wtkm:options_options_snapshot_1">-3.87%</span></td>
<td><span id="domhandler:9.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PRICE.wtkm:options_options_snapshot_1">960.40</span></td>
</tr>
Now I would like to extraxt the following information:
I have tried to use the following to extract the first information (the value 4000):
string url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var firstData = from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText;
but firstData doesn't contains the info I want (the value 4000) but this:
System.Linq.Enumerable+WhereSelectEnumerableIterator`2[HtmlAgilityPack.HtmlNode,System.String]
How can I get these data? I would also need to repeat this task several times cause in the page there is more than one line containing similar information. Is HTML Agility Pack useful in this context? Thanks.
Upvotes: 0
Views: 1826
Reputation: 4379
This may be somewhat ugly but it was quickly thrown together and could probably be cleaned up greatly, but it returns all of the values that you are looking for from the Prices/Quotes table found on that page. hope it helps.
var url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var pricesAndQuotesDataTable =
(from elem in
document.DocumentNode.Descendants()
.Where(
d =>
d.Attributes["class"] != null && d.Attributes["class"].Value == "toggleTitle" &&
d.ChildNodes.Any(h => h.InnerText != null && h.InnerText == "Prices/Quotes"))
select
elem.Descendants()
.FirstOrDefault(
d => d.Attributes["class"] != null && d.Attributes["class"].Value == "dataTable")).FirstOrDefault();
if (pricesAndQuotesDataTable != null)
{
var dataRows = from elem in pricesAndQuotesDataTable.Descendants()
where elem.Name == "tr" && elem.ParentNode.Name == "tbody"
select elem;
var dataPoints = new List<object>();
foreach (var row in dataRows)
{
var dataColumns = (from col in row.ChildNodes.Where(n => n.Name == "td")
select col).ToList();
dataPoints.Add(
new
{
StrikePrice = dataColumns[0].InnerText,
DifferenceToPreviousDay = dataColumns[9].InnerText,
LastPrice = dataColumns[10].InnerText
});
}
}
Upvotes: 1
Reputation: 1339
We did a similar project a few years back to spider all the major online betting websites and create a comparison tool to get the best prices for each type of event, eg. display all the major bookmakers with betting odds for a particular football game in order of best return.
Turned out to be a complete nightmare- the rendered html output for the websites kept changing almost daily and quite often generated poorly formed html which could sometimes crash the spider daemon, so we had to constantly maintain the system to keep it working properly.
With these sorts of things its often economical to subscribe to a data feed which requires much less maintenance and easier integration.
Upvotes: 1
Reputation: 5357
You could use the HtmlAgility Pack. Unlike XmlDocument or XDocument, the Html Agility pack is tolerant of malformed HTML (which exists all over the internet and probably on the site you are trying to parse).
Not all HTML pages can be assumed to be valid XML.
With the HTMLAgility pack you can load your page and parse it with XPath or an object model similar to System.Xml.
Optionally, you could use a PDF to Text Converter and parse a text file with much better accuracy, since the website you linked offers a PDF Export of that same data,
Upvotes: 1
Reputation: 1802
if you open to use CSQuery.. then try this one.
static void Main()
{
CsQuery.CQ cq = CsQuery.CQ.CreateFromUrl("http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411");
string str = cq["#notation115602071 span"].Text();
}
Upvotes: 1
Reputation: 25370
That's because your LINQ hasn't executed. If you check the Results View
in the debugger and run the query, you'll get all the items, the first being that value you are looking for.
So, this will get you 4,000.00
var firstData = (from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText).First();
if you want them all, call ToList()
instead of First()
Upvotes: 1