Reputation: 81
I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/
import urllib
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = bs (html)
tags = soup ('a')
for tag in tags:
print tag.get('href',None)
please guide me here and let me know why it is not working with www.stevens.edu?
Upvotes: 1
Views: 115
Reputation: 369324
The site check the User-Agent
header, and returns different html base on it.
You need to set User-Agent
header to get proper html:
import urllib
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)
Upvotes: 3