Custom HTMLParser with regex not returning correctly

Question

I'm working on a program that is scraping some information from an HTML-file based on different regex expressions. I've encountered an error with the following code

My HTMLParser subclass:

class MyHtmlParser(HTMLParser):
    def __init__(self):
        self.reset()
        self.title = []
    def handle_data(self, d):
        Result = re.search(r'ANMELDELSE .*(?=)',d)
        if Result:
            self.title.append(Result.group(0))
    def return_data(self):
        return self.title

Running the code:

with open(r'....', "r") as f: #correct path to local test.html
    page = f.read()
parser.feed(page)
parser.return_data()

Now the HTML file is really messy and in Norwegian, but here is a subset that should trigger this

Sak 428/18-123, 03.09.2018 
 
ANMELDELSE FOR TRAKASSERING

This should select "ANMELDELSE FOR TRAKASSERING" and it does in both https://regex101.com/ and in https://regexr.com/, but when executing the code, all I get printed is an empty list. The code has worked with previous regex calls, so I'm a bit lost.

Hope someone can help!

Wiktor Stribiżew · Accepted Answer

Granted your text has ANMELDELSE only in some text node, you may grab it using

r'ANMELDELSE[^<>]*'

Your original pattern contains a literal regular space (\x20). Instead of that space, a non-breaking space is often used to make sure the next word stays on the same line in text editors/viewers.

To match it, you could use \s and pass re.U modifier (it is required as you are using Python 2.7) to your re.search method, but since you want to match up to the end of the tag, just use a negated character class [^<>]*, any 0+ chars other than < and >.

Custom HTMLParser with regex not returning correctly

Answers (1)

Related Questions