geft
geft

Reputation: 625

Regex matching specific HTML string with Python

The pattern is as follows

page_pattern = 'manual-data-link" href="(.*?)"'

The matching function is as follows, where pattern is one of the predefined patterns like the above page_pattern

def get_pattern(pattern, string, group_num=1):
    escaped_pattern = re.escape(pattern)
    match = re.match(re.compile(escaped_pattern), string)

    if match:
        return match.group(group_num)
    else:
        return None

The problem is that match is always None, even though I made sure it works correctly with http://pythex.org/. I suspect I'm not compiling/escaping the pattern correctly.

Test string

<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>

Upvotes: 0

Views: 97

Answers (2)

Robᵩ
Robᵩ

Reputation: 168626

You have three problems.

1) You shouldn't call re.escape in this case. re.escape prevents special characters (like ., *, or ?) from having their special meanings. You want them to have special meanings here.

2) You should use re.search, not re.match re.match matches from the beginning of the string; you want to find a match anywhere inside the string.

3) You shouldn't parse HTML with regular expressions. Use a tool designed for the job, like BeautifulSoup.

Upvotes: 4

Avinash Raj
Avinash Raj

Reputation: 174706

re.match tries to match from the beginning of the string. Since the string you're trying to match is at the middle, you need to use re.search instead of re.match

>>> import re
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> re.search(r'manual-data-link" href="(.*?)"', s).group(1)
'/data/123421'

Use html parsers like BeautifulSoup to parse html files.

>>> from bs4 import BeautifulSoup
>>> s = '<a class="rarity-5 set-102 manual-data-link" href="/data/123421" data-id="20886" data-type-id="295636317" >Data</a>'
>>> soup = BeautifulSoup(s)
>>> for i in soup.find_all('a', class_=re.compile('.*manual-data-link')):
    print(i['href'])


/data/123421

Upvotes: 3

Related Questions