Reputation: 5779
I have a pattern I am trying to match using re.compile
. However, I cannot get the script to yield the desired result. Below is an example of some HTML code I am hoping to scrape, from the below HTML I hope to produce two list items.
Also below is my attempt at selecting the two list items:
import re
def getData():
trans_array = "" ##HTML data here
pattern2 = re.compile('<table width="100%" border="0" class="tbl t3 mobile-collapse">(.*)</table>')
print re.findall(pattern2, trans_array)
getData()
My feeling is that the code I used should work, but it has not. Any advice or comments would be appreciated.
Upvotes: 1
Views: 111
Reputation: 1614
By default .
in regular expression does not match new line characters. Add flags=re.S
parameter to re.compile
, and your regexp will work.
Upvotes: 3
Reputation:
Unless you tell it otherwise, the .
in Regex will not match newlines. However, instead of using flags=re.S
to fix this, I think a cleaner solution would be to just use the Regex syntax itself:
re.compile('(?s)<table width="100%" border="0" class="tbl t3 mobile-collapse">(.*?)</table>')
(?s)
does the same thing as flags=re.S
.
Also, I think you want to make your matches nongreedy to maximize matching. That is done by using (.*?)
instead of (.*)
Upvotes: 1