Reputation: 625
The pattern is as follows
page_pattern = 'manual-data-link" href="(.*?)"'
The matching function is as follows, where pattern
is one of the predefined patterns like the above page_pattern
def get_pattern(pattern, string, group_num=1):
escaped_pattern = re.escape(pattern)
match = re.match(re.compile(escaped_pattern), string)
if match:
return match.group(group_num)
else:
return None
The problem is that match is always None, even though I made sure it works correctly with http://pythex.org/. I suspect I'm not compiling/escaping the pattern correctly.
Test string
<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>
Upvotes: 0
Views: 97
Reputation: 168626
You have three problems.
1) You shouldn't call re.escape
in this case. re.escape
prevents special characters (like .
, *
, or ?
) from having their special meanings. You want them to have special meanings here.
2) You should use re.search
, not re.match
re.match
matches from the beginning of the string; you want to find a match anywhere inside the string.
3) You shouldn't parse HTML with regular expressions. Use a tool designed for the job, like BeautifulSoup.
Upvotes: 4
Reputation: 174706
re.match
tries to match from the beginning of the string. Since the string you're trying to match is at the middle, you need to use re.search
instead of re.match
>>> import re
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> re.search(r'manual-data-link" href="(.*?)"', s).group(1)
'/data/123421'
Use html parsers like BeautifulSoup to parse html
files.
>>> from bs4 import BeautifulSoup
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> soup = BeautifulSoup(s)
>>> for i in soup.find_all('a', class_=re.compile('.*manual-data-link')):
print(i['href'])
/data/123421
Upvotes: 3