How can i make sure that i am on About us page of a particular website

Question

Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage.

import requests
from BeautifulSoup import BeautifulSoup

url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))


def getURL(page):

    start_link = page.find("a href")
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1: end_quote]
    return url, end_quote

while True:
    url, n = getURL(page)
    page = page[n:]
    if url:
        print url
    else:
        break

The result is

/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity

Process finished with exit code 0

I want to get the URL of only "about us" page of a website which differs in many cases like

for Udacity it is https://www.udacity.com/us

For artscape-inc it is https://www.artscape-inc.com/about-decorative-window-film/

I mean, i could try searching for keywords like "about" in the URLs but as said i might have missed udacity in this approach. Could anyone suggest any good approach?

alecxe · Accepted Answer

It would not be easy to cover every possible variation of an "About us" page link, but here is the initial idea that would work in both cases you've shown - check for "about" inside the href attribute and the text of a elements:

def about_links(elm):
    return elm.name == "a" and ("about" in elm["href"].lower() or \
                                "about" in elm.get_text().lower())

Usage:

soup.find_all(about_links)  # or soup.find(about_links)

What you can also do to decrease the number of false positives is to check "footer" part of the page only. E.g. find footer element, or an element with id="footer" or having a footer class.

Another idea to sort of "outsource" the "about us" page definition, would be to google (from your script, of course) "about" + "webpage url" and grab the first search result.

As a side note, I've noticed you are still using BeautifulSoup version 3 - it is not being developed and maintained and you should switch to BeautifulSoup 4 as soon as possible, install it via:

pip install --upgrade beautifulsoup4

And change your import to:

from bs4 import BeautifulSoup

How can i make sure that i am on About us page of a particular website

Answers (1)

Related Questions