Reputation: 1795
Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage.
import requests
from BeautifulSoup import BeautifulSoup
url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
def getURL(page):
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
The result is
/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity
Process finished with exit code 0
I want to get the URL of only "about us" page of a website which differs in many cases like
for Udacity it is https://www.udacity.com/us
For artscape-inc it is https://www.artscape-inc.com/about-decorative-window-film/
I mean, i could try searching for keywords like "about" in the URLs but as said i might have missed udacity in this approach. Could anyone suggest any good approach?
Upvotes: 0
Views: 320
Reputation: 474141
It would not be easy to cover every possible variation of an "About us" page link, but here is the initial idea that would work in both cases you've shown - check for "about" inside the href
attribute and the text of a
elements:
def about_links(elm):
return elm.name == "a" and ("about" in elm["href"].lower() or \
"about" in elm.get_text().lower())
Usage:
soup.find_all(about_links) # or soup.find(about_links)
What you can also do to decrease the number of false positives is to check "footer" part of the page only. E.g. find footer
element, or an element with id="footer"
or having a footer
class.
Another idea to sort of "outsource" the "about us" page definition, would be to google (from your script, of course) "about" + "webpage url" and grab the first search result.
As a side note, I've noticed you are still using BeautifulSoup
version 3 - it is not being developed and maintained and you should switch to BeautifulSoup
4 as soon as possible, install it via:
pip install --upgrade beautifulsoup4
And change your import to:
from bs4 import BeautifulSoup
Upvotes: 1