Rhys Edwards
Rhys Edwards

Reputation: 791

Looping through elements with Beautifulsoup

I'm trying to store some data that's scraped from a website. The data I need is the text from the element, to then store in a csv for querying later on.

In the code below, I'm finding all references to the class 'vip'. Then I want to loop through those to strip away the unnecessary HTML to get the text data only. Finally I encode it with utf-8, ready to be inserted into a csv.

# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')

# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}

print title_box

# loop through each iteration
for each in title_box:
    if each.find('title_box'):
        title = title_box.text.strip().encode('utf-8')

# print the result
print title

However whenever I print the result of 'title' I get the following error:

Traceback (most recent call last):
  File "/Users/XXXX/Projects/project-kitchenaid/scaper.py", line 28, in <module>
    print title
NameError: name 'title' is not defined

From what I understand, title is out of scope. How do I retrieve the the data from the loop and write it to a print call?

For context, this is just one result of print title_box:

<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>]

Upvotes: 5

Views: 13898

Answers (4)

RajkumarG
RajkumarG

Reputation: 146

Here are the steps:

  1. title_box = soup.findAll('a', attrs={'class': 'vip'}) This line finds all the html having tag "a" and to further filter it using the required class vip.

  2. You cannot do if each.find('title_box'): because there is no html tag called title_box

  3. You can get the text using

    for each in soup: print(each.text.strip().encode('utf-8'))

No need to further use any conditional statements taking in reference the above extract

Upvotes: 2

Aphonisis
Aphonisis

Reputation: 1

Just need to be careful with the variable "title" scope.

Try this:

# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')

# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}

print(title_box)

# loop through each iteration

for each in title_box:
    if each.find('title_box'):
        title = title_box.text.strip().encode('utf-8')
        # print the result
        print(title)

or, in case you want to store all the results

# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')

# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}

print(title_box)

# loop through each iteration
title_list=[]
for each in title_box:
    if each.find('title_box'):
         title_list.append(title_box.text.strip().encode('utf-8'))

# print the results
for title in title_list:
       print(title)

Upvotes: 0

user4066647
user4066647

Reputation:

As I said in the comment, using each.find('title_box') won't fetch you anything, because there's no title_box tag.

Since you need the a elements with a class attribute of vip, this is what you should be checking for:

if 'vip' in each['class']:

Also, when this line of your code runs:

title_box = soup.findAll('a', attrs={'class': 'vip'}}

the title_box list is already populated with a elements that have a class attribute of vip. So, you don't have to check the same condition again in the for loop.

This is the code you should try:

for each in title_box:
    title = each.text.strip().encode('utf-8')
    print title

Of course, you can do away with assigning the text to a variable altogether and print it directly:

print each.text.strip().encode('utf-8')

Upvotes: 0

Bill Bell
Bill Bell

Reputation: 21643

I made an HTML file consisting of five copies of your a element and called it 'temp.htm':

<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>

Then I ran this code to get the texts in those links:

>>> page = open('temp.htm').read()
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for link in soup.select('.vip'):
...     link.text
... 
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'

You might still need to encode these texts for deposit in your csv file.

Upvotes: 1

Related Questions