finding unique web links using python

Question

I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/

import urllib
from bs4 import BeautifulSoup as bs

url = raw_input('enter - ')

html = urllib.urlopen(url).read()

soup = bs (html)

tags = soup ('a')

for tag in tags:
    print tag.get('href',None)

please guide me here and let me know why it is not working with www.stevens.edu?

falsetru · Accepted Answer

The site check the User-Agent header, and returns different html base on it.

You need to set User-Agent header to get proper html:

import urllib
import urllib2
from bs4 import BeautifulSoup as bs

url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'})  # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
    print tag.get('href', None)

finding unique web links using python

Answers (1)

Related Questions