goldisfine
goldisfine

Reputation: 4850

Re.match does not restrict urls

I would like to get only those school URLs in the table on this wiki page that lead to a page with information. The bad urls are colored red contain the phrase 'page does not exist' in side the 'title' attr. I am trying to use re.match() to filter the URLs such that I only return those which do not contain the aforementioned string. Why isn't re.match() working?

URL:

districts_page = 'https://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'

FUNCTION:

def url_check(url):

    all_urls = []

    r = requests.get(url, proxies = proxies)
    html_source = r.text
    soup = BeautifulSoup(html_source)

    for link in soup.find_all('a'):
        if type(link.get('title')) == str:
            if re.match(link.get('title'), '(page does not exist)') == None: 
                all_urls.append(link.get('href'))
            else: pass

    return 

Upvotes: 0

Views: 115

Answers (3)

caffreyd
caffreyd

Reputation: 1203

This does not address fixing the problem with re.match, but may be a valid approach for you without using regex:

  for link in soup.find_all('a'):
    title = link.get('title')
    if title:
      if not 'page does not exist' in title: 
        all_urls.append(link.get('href'))

Upvotes: 2

unutbu
unutbu

Reputation: 879759

The order of the arguments to re.match should be the pattern then the string. So try:

    if not re.search(r'(page does not exist)', link.get('title')): 

(I've also changed re.match to re.search since -- as @goldisfine observed -- the pattern does not occur at the beginning of the string.)


Using @kindall's observation, your code could also be simplified to

for link in soup.find_all('a', 
        title=lambda x: x is not None and 'page does not exist' not in x):
    all_urls.append(link.get('href'))

This eliminates the two if-statements. It can all be incorporated into the call to soup.find_all.

Upvotes: 0

goldisfine
goldisfine

Reputation: 4850

Unutbu's answer addresses the syntax error. But simply using re.match() is not enough. Re.match looks at the beginning of the string. re.search() goes through the entire string until it happens upon a section of the string that matches the entered pattern.

The following code works:

for link in soup.find_all('a'):
    if type(link.get('title')) == str:
        if re.search('page does not exist',link.get('title')) == None: 
            all_urls.append(link.get('href'))
return all_urls

Upvotes: 0

Related Questions