Reputation: 1478
I am trying to extract the names: "Harrisburg
" & "Gujranwala
" from the 2 pieces of code below:
<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>
The Regex as of now doesn't work, how to fix it?
My Regex:
(?<=<td><a href="\/worldclock\/city\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span id=p[0-9]{0, 4}s class=wds>( \*)</span><\/td>)
The regex is for python. Thanku
Upvotes: 0
Views: 229
Reputation: 3454
You can't use lookbehinds unless the lookbehind subexpression has fixed length. This is because the regex engine needs to know where to start looking for a match. In this case, the [0-9]{0, 5}
part means the regex can match strings of different lengths. (At least this is how it works in Perl.)
Upvotes: 0
Reputation: 12316
Depending on the variation of your original data, you don't need to specify the entire line, just the part around where you want to capture... The "active ingredient" is this part which captures all non-<
characters after the opening tag... >([^<]+)<
import re
InLines = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>\n<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""
Pattern = r'city\.html\?n=\d+">([^<]+)</a><span'
M = re.findall(Pattern, InLines)
print M
['Harrisburg', 'Gujranwala']
Upvotes: 1
Reputation: 50177
import re
city_html = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""
cities = re.findall(r'(?:city\.html.*?>)(.*?)(?:<)', city_html)
# cities == ['Harrisburg', 'Gujranwala']
What this RegEx is doing is looking for city.html ... >
and grabbing everything after it until the next <
.
Upvotes: 1