Reputation: 3860
i have gone though this post why not use regular expression for HTML. As a part of the task given to me, i had no choice but to use regular expression for HTML.
i have HTML code and separately tried like
<td class="a-nowrap">
<span class="a-letter-space"></span><span>13</span>
</td>
i have been able to get the 13 using following regular expression :
<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>
and similarly from
<td class="a-nowrap">
<a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>
</td>
got 5 star using the regular expression
<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(.*)</a>\s*</td>
But when both of the HTML code is combined like,
<table id="histogramTable" class="a-normal a-align-middle a-spacing-base">
<tr class="a-histogram-row">
<td class="a-nowrap">
<a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>
</td>
<td class="a-span10">
<a class="a-link-normal" title="69% of reviews have 5 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 69.1358024691358%;"></div></div></a>
</td>
<td class="a-nowrap">
<span class="a-letter-space"></span><span>13</span>
</td>
</tr>
<td class="a-nowrap">
<a class="a-link-normal" title="2% of reviews have 1 stars" href="">1 star</a><span class="a-letter-space"></span>
</td>
<td class="a-span10">
<a class="a-link-normal" title="2% of reviews have 1 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 2.46913580246914%;"></div></div></a>
</td>
<td class="a-nowrap">
<span class="a-letter-space"></span><span>2</span>
</td>
</table>
how to extract 5 star and 13 using regular expression?
Upvotes: 0
Views: 6335
Reputation: 3059
If you don't want to use HTML parser, use one regex after another or add .*
this between two patterns, I have modified a bit your star regex as it didn't work properly:
First enable dotall flag (s) and then use this:
<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(\d star).*<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>
Output:
Group 1: 5 star
Group 2: 13
EDIT:
I have made shorter regex:
REGEX:
>(\d star)<.+?>(\d+?)<
Which used on pythonregex.com with the edited input you have provided gives:
OUTPUT:
>>> regex.findall(string)
[(u'5 star', u'13'), (u'1 star', u'2')]
Upvotes: 1