Reputation: 63
I am trying to get Size inside an html page..
Html is
<tr>
<td style="padding-left: 5px;" class="subheader"
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2
file(s)</td>
</tr>
I tried this
size = re.search (r"""<tr>
<td style="padding-left: 5px;" class="subheader"
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">.+ in \d
file(s)</td>
</tr>""", Text)
But i get a None Type.. I only need it to give the 1.64 GB part.. Whar is wrong with it?
Upvotes: 0
Views: 62
Reputation: 1691
BeautifulSoup
is a better option for html parsing. However if you want to use regular expression. Here is what you can do.
import re
regex = r"<td.*>\s*(\d+[.]\d+\s+\w+).*<\/td>"
test_str = ("<tr> \n"
"<td style=\"padding-left: 5px;\" class=\"subheader\" \n"
"valign=\"top\" width=\"147\" align=\"right\">Size</td> \n"
"<td valign=\"top\" style=\"padding-left: 5px;\">1.64 GB in 2 \n"
"file(s)</td> \n"
"</tr>")
matches = re.search(regex, test_str, re.DOTALL)
try:
print(matches.group(1))
except Exception as e:
print (e)
Output
1.64 GB
Upvotes: 1
Reputation: 728
In general, I would avoid using regexes to parse HTML. It is likely easier for you to use beautifulsoup, or some other similar library. Using beautifulsoup in python:
In [1]: from bs4 import BeautifulSoup
In [2]: soup = BeautifulSoup(html, 'html.parser')
In [3]: soup
Out[3]:
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2
file(s)</td>
</tr>
In [4]: soup.tr
Out[4]:
<tr>
<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>
<td style="padding-left: 5px;" valign="top">1.64 GB in 2
file(s)</td>
</tr>
In [5]: soup.tr.find_all('td')
Out[5]:
[<td align="right" class="subheader" style="padding-left: 5px;" valign="top" width="147">Size</td>,
<td style="padding-left: 5px;" valign="top">1.64 GB in 2
file(s)</td>]
In [6]: soup.tr.find_all('td')[1]
Out[6]:
<td style="padding-left: 5px;" valign="top">1.64 GB in 2
file(s)</td>
In [7]: soup.tr.find_all('td')[1].text
Out[7]: '1.64 GB in 2 \nfile(s)'
If you need a more targeted way of searching the HTML, beautifulsoup provides a number of those.
Once you have the text in question, you can parse that with a regex, or string methods, or however else you'd like to. Not knowing your whole HTML document or what the other td elements like this look like, I wouldn't know how to guide you in constructing the exact regex or the exact way to use beautifulsoup. But this should get you close.
Upvotes: 1
Reputation: 82795
It is better idea to parse html using a html parser.
Ex: Using BeautifulSoup
from bs4 import BeautifulSoup
s = """<tr>
<td style="padding-left: 5px;" class="subheader"
valign="top" width="147" align="right">Size</td>
<td valign="top" style="padding-left: 5px;">1.64 GB in 2
file(s)</td>
</tr>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.tr.td.findNext('td').text)
print(re.findall("\d+.\d+ [A-Z]+", soup.tr.td.findNext('td').text.strip())) #Use regex to get only the required data.
Output:
1.64 GB in 2
file(s)
[u'1.64 GB']
Upvotes: 1