How to/Should I retrieve data from particularly formatted HTML without regex

Question

I have a whole pile of HTML which is just a bunch of this:


  
  6:07
  ThisIsSomeonesUsername
  This is my message. It is nice, no?

Repeated over and over again about a hundred thousand times (with different content, of course). This is all taken from an HTMLDocument by retrieving the element which holds all this. The document is retrieved from a WebBrowser in a Windows Form. This looks like:

HtmlDocument document = webBrowser1.Document;
HtmlElement element = document.GetElementById(chatElementId);

Assume "chatElementId" is just some known ID. What I would like to do is retrieve the content in "time" (6:07 in this example), "username" (ThisIsSomeonesUsername), and "message" (This is my message... etc.). The message portion can contain almost anything, including further html (such as links, images, etc.), but I want to keep all that intact. I was going to use a regular expression to parse the InnerHtml of the element retrieved using the method above, but apparently this will bring about the destruction of the universe. How then should I go about doing this?

Edit: People keep suggesting Html Agility Pack, so is there an easy way to go about doing this in Html Agility Pack without using the full HTML source? I'm not sure if the rest of the html outside of this class is all that great... but should I just pass the whole html anyway?

Noctis · Accepted Answer

Read the link on the Nico's answer ... I was about to post the same one (it's hilarious).

Having said that, from your comments it seems like you're intent on regex. So, regex it away.
It shouldn't be hard to do.

Go to http://regexpal.com/, paste your data on the bottom part, play with the regex part on the top until you're happy with the result, and just loop over your data and extract what you need to your heart content.

(I'm not sure if I'd do it, but sometimes a quick fix is better than a long more "correct" answer).

How to/Should I retrieve data from particularly formatted HTML without regex

Answers (2)

Related Questions