Reputation: 215
I am new to Python and I am trying to write a website scraper to get links from subreddits, which I can then pass to another class later on for automatic download of images from imagur.
In this code snippet, I am just trying to read the subreddit and scrape any imagur htmls from hrefs, but I get the following error:
AttributeError: 'list' object has no attribute 'timeout'
Any idea as to why this might be happening? Here is the code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import sys
from urlparse import urljoin
def get_category_links(base_url):
url = base_url
html = urlopen(url)
soup = BeautifulSoup(html)
posts = soup('a',{'class':'title may-blank loggedin outbound'})
#get the links with the class "title may-blank "
#which is how reddit defines posts
for post in posts:
print post.contents[0]
#print the post's title
if post['href'][:4] =='http':
print post['href']
else:
print urljoin(url,post['href'])
#print the url.
#if the url is a relative url,
#print the absolute url.
get_category_links(sys.argv)
Upvotes: 1
Views: 3702
Reputation: 474191
Look at how you call the function:
get_category_links(sys.argv)
sys.argv
here is a list of script arguments where the first item is the script name itself. This means that your base_url
argument value is a list which leads to failing urlopen
:
>>> from urllib2 import urlopen
>>> urlopen(["I am", "a list"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
│ │ │ └ <object object at 0x105e2c120>
│ │ └ None
│ └ ['I am', 'a list']
└ <urllib2.OpenerDirector instance at 0x105edc638>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in open
req.timeout = timeout
│ └ <object object at 0x105e2c120>
└ ['I am', 'a list']
AttributeError: 'list' object has no attribute 'timeout'
You meant to get the second argument from sys.argv
and pass it to get_category_links
:
get_category_links(sys.argv[1])
It's interesting though, how cryptic and difficult to understand the error in this case is. This is coming from the way the "url opener" works in Python 2.7. If, the url
value (the first argument) is not a string, it assumes it is a Request
instance and tries to set a timeout
value on it:
def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):
# accept a URL or a Request object
if isinstance(fullurl, basestring):
req = Request(fullurl, data)
else:
req = fullurl
if data is not None:
req.add_data(data)
req.timeout = timeout # <-- FAILS HERE
Note that the behavior have not actually changed in the latest stable 3.6 as well.
Upvotes: 4