Reputation: 1813
I'm learning Python, and the case for today is download text from a webpage.
This code works fine:
import urllib2
from bs4 import BeautifulSoup
base_url = "http://www.pracuj.pl"
url = urllib2.urlopen(base_url+"/praca/big%20data;kw").read()
soup = BeautifulSoup(url,"html.parser")
for k in soup.find_all('a'):
if "offer__list_item_link_name" in k['class']:
link = base_url+k['href']
print link
So it prints all links like this:
http://www.pracuj.pl/praca/inzynier-big-data-cloud-computing-knowledge-discovery-warszawa,oferta,4212875
http://www.pracuj.pl/praca/data-systems-administrator-krakow,oferta,4204109
http://www.pracuj.pl/praca/programista-java-sql-python-w-zespole-bigdata-krakow,oferta,4204341
http://www.pracuj.pl/praca/program-challenging-projektowanie-i-tworzenie-oprogramowania-katowice,oferta,4186995
http://www.pracuj.pl/praca/program-challenging-analizy-predyktywne-warszawa,oferta,4187512
http://www.pracuj.pl/praca/software-engineer-r-language-krakow,oferta,4239818
When add one line to assign new address, to fetch each lines content :
url2 = urllib2.urlopen(link).read()
I get an error:
Traceback (most recent call last):
File "download_page.py", line 10, in <module>
url2 = urllib2.urlopen(link).read()
NameError: name 'link' is not defined
What is wondering, it doesn't work only when in for
loop. When I add the same line outside the loop it works.
Can you point what I'm doing wrong?
Pawel
Upvotes: 0
Views: 160
Reputation: 38
That actually does work for me. How were you formatting your code?
Mine looks vaguely like this:
for k in soup.find_all('a'):
if "offer__list_item_link_name" in k['class']:
link = base_url+k['href']
#print link
url2 = urllib2.urlopen(link).read()
print url2
and works just fine.
Upvotes: 0
Reputation: 20571
I assume your line url2 = urllib2.urlopen(link).read()
is not in the same scope as your link
variable. The link
variable is local to the scope of the for
loop, so it will work if you move your call inside of the for loop.
for k in soup.find_all('a'):
if "offer__list_item_link_name" in k['class']:
link = base_url+k['href']
url2 = urllib2.urlopen(link).read()
If you want to process the url outside of the for loop, save your links in a list:
links = []
for k in soup.find_all('a'):
if "offer__list_item_link_name" in k['class']:
link = base_url+k['href']
links.append(link)
for link in links:
#do stuff with link
Upvotes: 1