Carlos Sanchez
Carlos Sanchez

Reputation: 1016

How to/Should I retrieve data from particularly formatted HTML without regex

I have a whole pile of HTML which is just a bunch of this:

<li id="entry-c7" data-user="ThisIsSomeonesUsername">
  <img width="28" height="28" class="avatar" src="http://very_long_url.png">
  <span class="time">6:07</span>
  <span class="username">ThisIsSomeonesUsername</span>
  <span class="message">This is my message. It is nice, no?</span>
</li>

Repeated over and over again about a hundred thousand times (with different content, of course). This is all taken from an HTMLDocument by retrieving the element which holds all this. The document is retrieved from a WebBrowser in a Windows Form. This looks like:

HtmlDocument document = webBrowser1.Document;
HtmlElement element = document.GetElementById(chatElementId);

Assume "chatElementId" is just some known ID. What I would like to do is retrieve the content in "time" (6:07 in this example), "username" (ThisIsSomeonesUsername), and "message" (This is my message... etc.). The message portion can contain almost anything, including further html (such as links, images, etc.), but I want to keep all that intact. I was going to use a regular expression to parse the InnerHtml of the element retrieved using the method above, but apparently this will bring about the destruction of the universe. How then should I go about doing this?

Edit: People keep suggesting Html Agility Pack, so is there an easy way to go about doing this in Html Agility Pack without using the full HTML source? I'm not sure if the rest of the html outside of this class is all that great... but should I just pass the whole html anyway?

Upvotes: 0

Views: 99

Answers (2)

Noctis
Noctis

Reputation: 11763

Read the link on the Nico's answer ... I was about to post the same one (it's hilarious).

Having said that, from your comments it seems like you're intent on regex. So, regex it away.
It shouldn't be hard to do.

Go to http://regexpal.com/, paste your data on the bottom part, play with the regex part on the top until you're happy with the result, and just loop over your data and extract what you need to your heart content.

(I'm not sure if I'd do it, but sometimes a quick fix is better than a long more "correct" answer).

Upvotes: 1

Nico
Nico

Reputation: 12683

Just an FYI Regex cant parse HTML in any usable fasion... RegEx match open tags except XHTML self-contained tags just for those that stumble across this post.

Now for your requirement have you tried using XmlDocument or XDocument?

Just try the following (note the img tag is missing the end />) if that is the case in your HTML this wont work as its not valid XML).

//parse the xml
var xDoc = XDocument.Parse(html);

//create our list of results (basic tuple here, could be your class)
List<Tuple<string, string, string>> attributes = new List<Tuple<string, string, string>>();

//iterate all li elemenets
foreach (var element in xDoc.Root.Elements("li"))
{
    //set the default values
    string time = "",
            username = "",
            message = "";

    //get the time, username message attributes
    XElement tElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "time");
    XElement uElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "username");
    XElement mElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "message");

    //set our values based on element results
    if (tElem != null)
        time = tElem.Value;

    if (uElem != null)
        username = uElem.Value;

    if (mElem != null)
        message = mElem.Value;

    //add to our list
    attributes.Add(new Tuple<string, string, string>(time, username, message));
}

Upvotes: 1

Related Questions