BenMyr
BenMyr

Reputation: 15

Custom HTMLParser with regex not returning correctly

I'm working on a program that is scraping some information from an HTML-file based on different regex expressions. I've encountered an error with the following code

My HTMLParser subclass:

class MyHtmlParser(HTMLParser):
    def __init__(self):
        self.reset()
        self.title = []
    def handle_data(self, d):
        Result = re.search(r'ANMELDELSE .*(?=</b>)',d)
        if Result:
            self.title.append(Result.group(0))
    def return_data(self):
        return self.title

Running the code:

with open(r'....', "r") as f: #correct path to local test.html
    page = f.read()
parser.feed(page)
parser.return_data()

Now the HTML file is really messy and in Norwegian, but here is a subset that should trigger this

<p style="margin: 0cm 0cm 0pt;"><span style="text-decoration: underline;">Sak 428/18-123, 03.09.2018 </span></p>
<p style="margin: 0cm 0cm 0pt;"><b>&nbsp;</b></p>
<p style="margin: 0cm 0cm 0pt;"><b>ANMELDELSE FOR TRAKASSERING</b></p>

This should select "ANMELDELSE FOR TRAKASSERING" and it does in both https://regex101.com/ and in https://regexr.com/, but when executing the code, all I get printed is an empty list. The code has worked with previous regex calls, so I'm a bit lost.

Hope someone can help!

Upvotes: 1

Views: 35

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

Granted your text has ANMELDELSE only in some text node, you may grab it using

r'ANMELDELSE[^<>]*'

Your original pattern contains a literal regular space (\x20). Instead of that space, a non-breaking space is often used to make sure the next word stays on the same line in text editors/viewers.

To match it, you could use \s and pass re.U modifier (it is required as you are using Python 2.7) to your re.search method, but since you want to match up to the end of the tag, just use a negated character class [^<>]*, any 0+ chars other than < and >.

Upvotes: 1

Related Questions