Reputation: 53

Problem extracting text out of html file using python regex

I'm working on a project that requires me to write some code to pull out some text from a html file in python.

<tr>
<td>Target binary file name:</td>
<td class="right">Doc1.docx</td>
</tr>

^Small portion of the html file that I'm interested in.

#! /usr/bin/python
import os
import re    

if __name__ == '__main__':
    f = open('./results/sample_result.html')
    soup = f.read()
    p = re.compile("binary")
    for line in soup:
        m = p.search(line)
        if m:
            print "finally"
            break

^Sample code I wrote to test if I could extract data out. I've written several programs similar to this to extract text from txt files almost exactly the same and they have worked just fine. Is there something I'm missing out with regards to regex and html?

Upvotes: 0

Answers (3)

PaulMcG

Reputation: 63782

HTML as understood by browsers is waaaay too flexible for reg expressions. Attributes can pop up in any tag, and in any order, and in upper or lower case, and with or without quotation marks about the value. Special emphasis tags can show up anywhere. Whitespace is significant in regex, but not so much in HTML, so your regex has to be littered with \s*'s everywhere. There is no requirement that opening tags be matched with closing tags. Some opening tags include a trailing '/', meaning that they are empty tags (no body, no closing tag). Lastly, HTML is often nested, which is pretty much off the chart as far as regex is concerned.

Upvotes: 0

Katriel

Reputation: 123772

Is this actually what you're trying to do, or just a simple example for a more complicated regex later? If the latter, listen to everyone else. If the former:

for line in file:
      if "binary" in line:
            # do stuff

If that doesn't work, are you sure "binary" is in the file? Not, I don't know, "<i>b</i>inary"?

Upvotes: 0

S.Lott

Reputation: 392010

Is there something I'm missing out with regards to regex and html?

Yes. You're missing the fact that some HTML cannot be parsed with a simple regex.

Upvotes: 4

Problem extracting text out of html file using python regex

Answers (3)

Related Questions