AlexH
AlexH

Reputation: 73

Get data from HTML child class

I’m attempting to create a tool, in C#, which gathers and analyses data from a web page/form. There are basically 2 different types of data. Data entered by a user and data created by the system (I don’t have access to).

The data created by the user is kept in fields and the form uses IDs - so GetElementByID is used. The problem I’m running into is obtaining the data created by the system. It shows on the form, but isn’t associated to an ID. I may be reading/interpreting the HTML incorrectly, but it appears to be a child class (I don’t have much HTML experience). I’m attempting to get the “Date Submitted” data (near the bottom of the code). Sample of the HTML code:

<div class="bottomSpace">
    <div class="importfromanotherorder">
        <div class="level2Panel" >

           <div class="left">
                <span id="if error" class="error"></span>
             </div>

           <div class="right">
                Enter Submission ID
                <input name="Submission$ID" type="text" id="Submission_ID" class="textbox" />
                <input type="submit" name="SumbitButton" value="Import" id="SubmitButton" />
            </div>
        </div>
    </div>
</div>

<div class="bottomSpace">
    <div class="detailsinfo">
        <div class="level2Panel" >

        <div class="left">
                <h5>Product ID</h5>
                1234567
                <h5>Sub ID</h5>
                Not available
                <h5>Product Type</h5>
                Type 1
        </div>

        <div class="right">
                <h5>Order Number</h5>
                0987654
              <h5>Status</h5>
                Ordered
                <h5>Date Submitted</h5>
                7 17 2012 5 45 09 AM
            </div>
        </div>
    </div>
</div>

Using GetElementsByTagName (searching for “div”) and then using GetAttribute(“className”) (searching for “right”) generates some results, but as there are 2 “right” classes, it’s not working as intended.

I’ve tried searching by className = “detailsinfo”, which I can find, but I’m not sure how I could go about getting down to the “right” class. I tried sibling and children, but the results don't appear to be working. The next possible problem is that it appears the date data is actually text belonging to class “right” and not element “Date Submitted” .

So basically, I'm curious as to how the best approach would be to get the data I'm looking for. Would I need to get all of the class “right” text and then try and extract the date string?

Apologizes if there is too much info or not enough of the required info :) Thanks in advance!

EDIT: Added how GetElementsByTagName is called using C# - per Icarus's comment.

HtmlDocument doc = webBrowser1.Document;
HtmlElementCollection elemColl = doc.GetElementsByTagName("div");

Upvotes: 2

Views: 1022

Answers (1)

earth_tom
earth_tom

Reputation: 831

This will do it if the 'right' instance you want is the 2nd. Two approaches given:

The commented-out approach is it's zero based, so uses instance 1. The second approach is xpath and is therefore one-based so uses instance 2.

private string ReadHTML(string html)
{

  System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
  doc.LoadXml(html);
  System.Xml.XmlElement element = doc.DocumentElement;

  //This commented-out approach works and might be preferred if you want to iterate
  //over a node set instead of choosing just one node
  //string key = "//div[@class='right']";
  //System.Xml.XmlNodeList setting = element.SelectNodes(key);
  //return setting[1].LastChild.InnerText;

  // This xpath appraoch will let you select exactly one node:
  string key = "((//div[@class='right'])[2])/child::text()[last()]";
  System.Xml.XmlNode setting = element.SelectSingleNode(key);
  return setting.InnerText;

}

Upvotes: 1

Related Questions