Reputation: 37
I am trying to find a URL containing '.ics' in an href. I tested this code the other day and it was working perfectly, but now when I try to search 'for link in links', 'print link' results in: ``
<a class="element-invisible element-focusable" href="#main-content"
tabindex="1">Skip to main content</a>
<a class="element-invisible element-focusable" href="#main-content">Skip to
main content</a>
Becuase of this, the 'if link.get('href')' code is never satisfied and the URL is not returned. What is causing this, and is there another way to return the URL containing '.ics'?
page = requests.get('https://registrar.fas.harvard.edu/calendar').content
soup = bs4.BeautifulSoup(page, 'lxml')
links = soup.find_all('a')
#print links
for link in links:
print link
if link.get('href') != None and '.ics' in link.get('href'):
endout = link.get('href')
if endout[:6] == 'webcal':
endout ='https' + endout[6:]
print
print 'URL: ' + endout
print
return endout
break
Upvotes: 0
Views: 83
Reputation: 402263
I would recommend streamlining your search by passing a css href
selector and regex pattern:
links = soup.find_all('a', {'href' : re.compile('.*\.ics') })
Output:
[<a class="subscribe" href="https://registrar.fas.harvard.edu/calendar/upcoming/all/export.ics">subscribe</a>,
<a class="ical" href="https://registrar.fas.harvard.edu/calendar/upcoming/all/export.ics">iCal</a>]
You won't have to jump through hoops to validate your anchor tags now.
Upvotes: 3