c#httpweb-scrapingscreen-scrapinghtml-agility-pack

Reputation: 477

C# to get data from a website

I would like to get the data from this website and put them into a dictionary.

Basically these are prices and quantities for some financial instruments.

I have this source code for the page (here is just an extract of the whole text):

<tr>
   <td class="quotesMaxTime1414148558" id="notation115602071"><span>4,000.00</span></td>
   <td><span>0</span></td>
   <td class="icon red"><span id="domhandler:8.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PERFORMANCE_PCT.wtkm:options_options_snapshot_1">-3.87%</span></td>
   <td><span id="domhandler:9.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PRICE.wtkm:options_options_snapshot_1">960.40</span></td>       
</tr>

Now I would like to extraxt the following information:

The value "4000" from the second line;
The value "-3.87%" from the fourth line;
The value "960.40" from the fifth line.

I have tried to use the following to extract the first information (the value 4000):

        string url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";

        var webGet = new HtmlWeb();
        var document = webGet.Load(url);

        var firstData = from x in document.DocumentNode.Descendants()
                     where x.Name == "td" && x.Attributes.Contains("class")
                     select x.InnerText;

but firstData doesn't contains the info I want (the value 4000) but this:

System.Linq.Enumerable+WhereSelectEnumerableIterator`2[HtmlAgilityPack.HtmlNode,System.String]

How can I get these data? I would also need to repeat this task several times cause in the page there is more than one line containing similar information. Is HTML Agility Pack useful in this context? Thanks.

Upvotes: 0

Answers (5)

mreyeros

Reputation: 4379

This may be somewhat ugly but it was quickly thrown together and could probably be cleaned up greatly, but it returns all of the values that you are looking for from the Prices/Quotes table found on that page. hope it helps.

 var url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";

        var webGet = new HtmlWeb();
        var document = webGet.Load(url);


        var pricesAndQuotesDataTable =
            (from elem in
                document.DocumentNode.Descendants()
                    .Where(
                        d =>
                            d.Attributes["class"] != null && d.Attributes["class"].Value == "toggleTitle" &&
                            d.ChildNodes.Any(h => h.InnerText != null && h.InnerText == "Prices/Quotes"))
                select
                    elem.Descendants()
                        .FirstOrDefault(
                            d => d.Attributes["class"] != null && d.Attributes["class"].Value == "dataTable")).FirstOrDefault();
        if (pricesAndQuotesDataTable != null)
        {
            var dataRows = from elem in pricesAndQuotesDataTable.Descendants()
                where elem.Name == "tr" && elem.ParentNode.Name == "tbody"
                select elem;

            var dataPoints = new List<object>();
            foreach (var row in dataRows)
            {
                var dataColumns = (from col in row.ChildNodes.Where(n => n.Name == "td")
                    select col).ToList();

                dataPoints.Add(
                    new
                    {
                        StrikePrice = dataColumns[0].InnerText,
                        DifferenceToPreviousDay = dataColumns[9].InnerText,
                        LastPrice = dataColumns[10].InnerText
                    });
            }
        }

enter image description here

Upvotes: 1

Myke Black

Reputation: 1339

We did a similar project a few years back to spider all the major online betting websites and create a comparison tool to get the best prices for each type of event, eg. display all the major bookmakers with betting odds for a particular football game in order of best return.

Turned out to be a complete nightmare- the rendered html output for the websites kept changing almost daily and quite often generated poorly formed html which could sometimes crash the spider daemon, so we had to constantly maintain the system to keep it working properly.

With these sorts of things its often economical to subscribe to a data feed which requires much less maintenance and easier integration.

Upvotes: 1

Ryan Mann

Reputation: 5357

You could use the HtmlAgility Pack. Unlike XmlDocument or XDocument, the Html Agility pack is tolerant of malformed HTML (which exists all over the internet and probably on the site you are trying to parse).

Not all HTML pages can be assumed to be valid XML.

With the HTMLAgility pack you can load your page and parse it with XPath or an object model similar to System.Xml.

Html Agility Pack

Optionally, you could use a PDF to Text Converter and parse a text file with much better accuracy, since the website you linked offers a PDF Export of that same data,

PDF Export Link

Convert PDF to Text

Upvotes: 1

sm.abdullah

Reputation: 1812

if you open to use CSQuery.. then try this one.

 static void Main()
{
        CsQuery.CQ cq = CsQuery.CQ.CreateFromUrl("http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411");
        string str = cq["#notation115602071 span"].Text();

}

Upvotes: 1

Jonesopolis

Reputation: 25370

That's because your LINQ hasn't executed. If you check the Results View in the debugger and run the query, you'll get all the items, the first being that value you are looking for.

So, this will get you 4,000.00

var firstData = (from x in document.DocumentNode.Descendants()
                 where x.Name == "td" && x.Attributes.Contains("class")
                 select x.InnerText).First();

if you want them all, call ToList() instead of First()

Upvotes: 1

C# to get data from a website

Answers (5)

Related Questions