Reputation: 1478

Fix regex to extract city names from HTML

I am trying to extract the names: "Harrisburg" & "Gujranwala" from the 2 pieces of code below:

<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>

The Regex as of now doesn't work, how to fix it?

My Regex:

(?<=<td><a href="\/worldclock\/city\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span id=p[0-9]{0, 4}s class=wds>( \*)</span><\/td>)

The regex is for python. Thanku

Upvotes: 0

Answers (4)

David Knipe

Reputation: 3454

You can't use lookbehinds unless the lookbehind subexpression has fixed length. This is because the regex engine needs to know where to start looking for a match. In this case, the [0-9]{0, 5} part means the regex can match strings of different lengths. (At least this is how it works in Perl.)

Upvotes: 0

beroe

Reputation: 12316

Depending on the variation of your original data, you don't need to specify the entire line, just the part around where you want to capture... The "active ingredient" is this part which captures all non-< characters after the opening tag... >([^<]+)<

import re
InLines = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>\n<td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""
Pattern = r'city\.html\?n=\d+">([^<]+)</a><span'
M = re.findall(Pattern, InLines)
print M
['Harrisburg', 'Gujranwala']

Upvotes: 1

mVChr

Reputation: 50177

import re

city_html = """<td><a href="/worldclock/city.html?n=97">Harrisburg</a><span id=p217s class=wds> *</span></td>
               <td><a href="/worldclock/city.html?n=3551">Gujranwala</a><span id=p204s class=wds></span></td>"""

cities = re.findall(r'(?:city\.html.*?>)(.*?)(?:<)', city_html)
# cities == ['Harrisburg', 'Gujranwala']

What this RegEx is doing is looking for city.html ... > and grabbing everything after it until the next <.

Upvotes: 1

Emil Davtyan

Reputation: 14089

Try this regex :

([^>]*)<\s*/a\s*>

Upvotes: 0

Fix regex to extract city names from HTML

Answers (4)

Related Questions