classyhobo
classyhobo

Reputation: 245

python 2.7 re.MULTILINE troubles

I am new to python and I have been trying to change my php regex into python but I have run into some problems with this multiline thing. I have been up and down the internet for the past couple days and I can't seem to make sense of it, if someone could help that would be great. Here is the regex I have made:

mlsTagRegex = re.compile("<td\swidth=\"13%\"\sclass=\"TopHeader\">(.*?)</td>", re.MULTILINE)
tdTags = mlsTagRegex.findall(output.getvalue())
print tdTags

Here is the HTML I would like it to find:

<td width="13%" class="TopHeader">

   <span class="red">I WANT THIS PART</span>

</td>

and it just gives me an empty array. I'm pretty sure what I am missing is probably fairly simple but like I said I am new to python so if anyone could help? Thanks!

p.s.: the output in findall is what pycurl is outputting and that part of the html is in there.

Upvotes: 3

Views: 335

Answers (2)

Zach Kelling
Zach Kelling

Reputation: 53879

You need to use re.DOTALL to make . match newline characters:

mlsTagRegex = re.compile(r'<td width="13%" class="TopHeader">(.*?)</td>', re.DOTALL)

But really you should avoid using regex for parsing html, use BeautifulSoup or lxml instead.

Upvotes: 2

Ceramic Pot
Ceramic Pot

Reputation: 280

Use re.DOTALL,so the '.' character will match any character,including the newline.

Upvotes: 1

Related Questions