Ryflex
Ryflex

Reputation: 5779

How to fix this non working regex pattern matching?

I have a pattern I am trying to match using re.compile. However, I cannot get the script to yield the desired result. Below is an example of some HTML code I am hoping to scrape, from the below HTML I hope to produce two list items.

Also below is my attempt at selecting the two list items:

import re

def getData():  

    trans_array = "" ##HTML data here
    pattern2 = re.compile('<table width="100%" border="0" class="tbl t3 mobile-collapse">(.*)</table>')

    print re.findall(pattern2, trans_array)

getData()

My feeling is that the code I used should work, but it has not. Any advice or comments would be appreciated.

Upvotes: 1

Views: 111

Answers (2)

Oleg Eterevsky
Oleg Eterevsky

Reputation: 1614

By default . in regular expression does not match new line characters. Add flags=re.S parameter to re.compile, and your regexp will work.

Upvotes: 3

user2555451
user2555451

Reputation:

Unless you tell it otherwise, the . in Regex will not match newlines. However, instead of using flags=re.S to fix this, I think a cleaner solution would be to just use the Regex syntax itself:

re.compile('(?s)<table width="100%" border="0" class="tbl t3 mobile-collapse">(.*?)</table>')

(?s) does the same thing as flags=re.S.

Also, I think you want to make your matches nongreedy to maximize matching. That is done by using (.*?) instead of (.*)

Upvotes: 1

Related Questions