Reputation: 4850
I would like to get only those school URLs in the table on this wiki page that lead to a page with information. The bad urls are colored red contain the phrase 'page does not exist' in side the 'title' attr. I am trying to use re.match() to filter the URLs such that I only return those which do not contain the aforementioned string. Why isn't re.match() working?
URL:
districts_page = 'https://en.wikipedia.org/wiki/List_of_school_districts_in_Alabama'
FUNCTION:
def url_check(url):
all_urls = []
r = requests.get(url, proxies = proxies)
html_source = r.text
soup = BeautifulSoup(html_source)
for link in soup.find_all('a'):
if type(link.get('title')) == str:
if re.match(link.get('title'), '(page does not exist)') == None:
all_urls.append(link.get('href'))
else: pass
return
Upvotes: 0
Views: 115
Reputation: 1203
This does not address fixing the problem with re.match
, but may be a valid approach for you without using regex:
for link in soup.find_all('a'):
title = link.get('title')
if title:
if not 'page does not exist' in title:
all_urls.append(link.get('href'))
Upvotes: 2
Reputation: 879759
The order of the arguments to re.match
should be the pattern then the string. So try:
if not re.search(r'(page does not exist)', link.get('title')):
(I've also changed re.match
to re.search
since -- as @goldisfine observed -- the pattern does not occur at the beginning of the string.)
Using @kindall's observation, your code could also be simplified to
for link in soup.find_all('a',
title=lambda x: x is not None and 'page does not exist' not in x):
all_urls.append(link.get('href'))
This eliminates the two if-statements
. It can all be incorporated into the call to soup.find_all
.
Upvotes: 0
Reputation: 4850
Unutbu's answer addresses the syntax error. But simply using re.match() is not enough. Re.match looks at the beginning of the string. re.search()
goes through the entire string until it happens upon a section of the string that matches the entered pattern.
The following code works:
for link in soup.find_all('a'):
if type(link.get('title')) == str:
if re.search('page does not exist',link.get('title')) == None:
all_urls.append(link.get('href'))
return all_urls
Upvotes: 0