Reputation: 25
I am trying to obtain data-PID and price from Craigslist using beautifulsoup. I have written a separate code which gives me the file CLallsites.txt. In this code I am trying to grab each of those sites from the txt file and get the PIDs of all entries in the first 10 pages. My code is:
from bs4 import BeautifulSoup
from urllib2 import urlopen
readfile = open("CLallsites.txt")
product = "mcy"
while 1:
u = ""
count = 0
line = readfile.readline()
commaposition = line.find(',')
site = line[0:commaposition]
location = line[commaposition+1:]
site_filename = location + '.txt'
f = open(site_filename, "a")
while (count < 10):
sitenow = site + "\\" + product + "\\" + str(u)
html = urlopen(str(sitenow))
soup = BeautifulSoup(html)
postings = soup('p',{"class":"row"})
for post in postings:
y = post['data-pid']
print y
count = count +1
index = count*100
u = "index" + str(index) + ".html"
if not line:
break
pass
My CLallsites.txt looks like this:
craiglist site, location (Stackoverflow does not allow posting with cragslist links so I cannot show the text, I could try to attach the text file if that helps.)
when I run the code I get the following error:
Traceback (most recent call last):
File "reading.py", line 16, in html = urlopen(str(sitenow))
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open raise URLError(err)
urllib2.URLError:
Any ideas about what I am doing wrong?
Upvotes: 0
Views: 533
Reputation: 20689
I don't know what is the content of sitenow
, but it looks like it is an invalid URL. Note that URLs use slashes and not backslashes (so the statement sould be something similar to sitenow = site + "/" + product + "/" + str(u)
)
Upvotes: 0