psmith
psmith

Reputation: 1813

Cannot get url using urllib2


I'm learning Python, and the case for today is download text from a webpage. This code works fine:

import urllib2
from bs4 import BeautifulSoup
base_url = "http://www.pracuj.pl"
url = urllib2.urlopen(base_url+"/praca/big%20data;kw").read()
soup = BeautifulSoup(url,"html.parser")

for k in soup.find_all('a'):
    if "offer__list_item_link_name" in k['class']: 
        link = base_url+k['href']
        print link

So it prints all links like this:

http://www.pracuj.pl/praca/inzynier-big-data-cloud-computing-knowledge-discovery-warszawa,oferta,4212875
http://www.pracuj.pl/praca/data-systems-administrator-krakow,oferta,4204109
http://www.pracuj.pl/praca/programista-java-sql-python-w-zespole-bigdata-krakow,oferta,4204341
http://www.pracuj.pl/praca/program-challenging-projektowanie-i-tworzenie-oprogramowania-katowice,oferta,4186995
http://www.pracuj.pl/praca/program-challenging-analizy-predyktywne-warszawa,oferta,4187512
http://www.pracuj.pl/praca/software-engineer-r-language-krakow,oferta,4239818

When add one line to assign new address, to fetch each lines content :

url2 = urllib2.urlopen(link).read()

I get an error:

Traceback (most recent call last):
  File "download_page.py", line 10, in <module>
    url2 = urllib2.urlopen(link).read()
NameError: name 'link' is not defined

What is wondering, it doesn't work only when in for loop. When I add the same line outside the loop it works.

Can you point what I'm doing wrong?

Pawel

Upvotes: 0

Views: 160

Answers (2)

hlmtre
hlmtre

Reputation: 38

That actually does work for me. How were you formatting your code?

Mine looks vaguely like this:

for k in soup.find_all('a'):
if "offer__list_item_link_name" in k['class']: 
    link = base_url+k['href']
    #print link
    url2 = urllib2.urlopen(link).read()
    print url2

and works just fine.

Upvotes: 0

TheoretiCAL
TheoretiCAL

Reputation: 20571

I assume your line url2 = urllib2.urlopen(link).read() is not in the same scope as your link variable. The link variable is local to the scope of the for loop, so it will work if you move your call inside of the for loop.

for k in soup.find_all('a'):
    if "offer__list_item_link_name" in k['class']: 
        link = base_url+k['href']
        url2 = urllib2.urlopen(link).read()

If you want to process the url outside of the for loop, save your links in a list:

links = []
for k in soup.find_all('a'):
    if "offer__list_item_link_name" in k['class']: 
        link = base_url+k['href']
        links.append(link)

for link in links:
    #do stuff with link

Upvotes: 1

Related Questions