maurizio de ruggiero
maurizio de ruggiero

Reputation: 63

Python regular expression in html

I am trying to get Size inside an html page..

Html is

<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2 
file(s)</td>
</tr>

I tried this

size = re.search (r"""<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">.+ in \d
file(s)</td>
</tr>""", Text) 

But i get a None Type.. I only need it to give the 1.64 GB part.. Whar is wrong with it?

Upvotes: 0

Views: 62

Answers (3)

Sumit Jha
Sumit Jha

Reputation: 1691

BeautifulSoup is a better option for html parsing. However if you want to use regular expression. Here is what you can do.

import re
regex = r"<td.*>\s*(\d+[.]\d+\s+\w+).*<\/td>"
test_str = ("<tr> \n"
    "<td style=\"padding-left: 5px;\" class=\"subheader\"  \n"
    "valign=\"top\" width=\"147\" align=\"right\">Size</td> \n"
    "<td valign=\"top\" style=\"padding-left: 5px;\">1.64 GB in 2  \n"
    "file(s)</td> \n"
    "</tr>")

matches = re.search(regex, test_str, re.DOTALL)
try:
    print(matches.group(1))
except Exception as e:
    print (e)

Output

1.64 GB

Upvotes: 1

ryanmrubin
ryanmrubin

Reputation: 728

In general, I would avoid using regexes to parse HTML. It is likely easier for you to use beautifulsoup, or some other similar library. Using beautifulsoup in python:

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup(html, 'html.parser')

In [3]: soup
Out[3]: 
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>
</tr>

In [4]: soup.tr
Out[4]: 
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>
</tr>

In [5]: soup.tr.find_all('td')
Out[5]: 
[<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>,
 <td style="padding-left: 5px;" valign="top">1.64 GB in 2 
 file(s)</td>]

In [6]: soup.tr.find_all('td')[1]
Out[6]: 
<td style="padding-left: 5px;" valign="top">1.64 GB in 2 
file(s)</td>

In [7]: soup.tr.find_all('td')[1].text
Out[7]: '1.64 GB in 2 \nfile(s)'

If you need a more targeted way of searching the HTML, beautifulsoup provides a number of those.

Once you have the text in question, you can parse that with a regex, or string methods, or however else you'd like to. Not knowing your whole HTML document or what the other td elements like this look like, I wouldn't know how to guide you in constructing the exact regex or the exact way to use beautifulsoup. But this should get you close.

Upvotes: 1

Rakesh
Rakesh

Reputation: 82795

It is better idea to parse html using a html parser.

Ex: Using BeautifulSoup

from bs4 import BeautifulSoup
s = """<tr>
<td style="padding-left: 5px;" class="subheader" 
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2 
file(s)</td>
</tr>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.tr.td.findNext('td').text)
print(re.findall("\d+.\d+ [A-Z]+", soup.tr.td.findNext('td').text.strip()))   #Use regex to get only the required data.

Output:

1.64 GB in 2 
file(s)
[u'1.64 GB']

Upvotes: 1

Related Questions