Reputation: 791
I'm trying to store some data that's scraped from a website. The data I need is the text from the element, to then store in a csv for querying later on.
In the code below, I'm finding all references to the class 'vip'. Then I want to loop through those to strip away the unnecessary HTML to get the text data only. Finally I encode it with utf-8, ready to be inserted into a csv.
# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')
# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}
print title_box
# loop through each iteration
for each in title_box:
if each.find('title_box'):
title = title_box.text.strip().encode('utf-8')
# print the result
print title
However whenever I print the result of 'title' I get the following error:
Traceback (most recent call last):
File "/Users/XXXX/Projects/project-kitchenaid/scaper.py", line 28, in <module>
print title
NameError: name 'title' is not defined
From what I understand, title
is out of scope. How do I retrieve the the data from the loop and write it to a print call?
For context, this is just one result of print title_box
:
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>]
Upvotes: 5
Views: 13898
Reputation: 146
Here are the steps:
title_box = soup.findAll('a', attrs={'class': 'vip'})
This line finds all the html having tag "a" and to further filter it using the required class vip.
You cannot do if each.find('title_box'):
because there is no html tag called title_box
You can get the text using
for each in soup: print(each.text.strip().encode('utf-8'))
No need to further use any conditional statements taking in reference the above extract
Upvotes: 2
Reputation: 1
Just need to be careful with the variable "title" scope.
Try this:
# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')
# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}
print(title_box)
# loop through each iteration
for each in title_box:
if each.find('title_box'):
title = title_box.text.strip().encode('utf-8')
# print the result
print(title)
or, in case you want to store all the results
# parse the page and store in var soup
soup = BeautifulSoup(page, 'html.parser')
# find the title
title_box = soup.findAll('a', attrs={'class': 'vip'}}
print(title_box)
# loop through each iteration
title_list=[]
for each in title_box:
if each.find('title_box'):
title_list.append(title_box.text.strip().encode('utf-8'))
# print the results
for title in title_list:
print(title)
Upvotes: 0
Reputation:
As I said in the comment, using each.find('title_box')
won't fetch you anything, because there's no title_box
tag.
Since you need the a
elements with a class
attribute of vip
, this is what you should be checking for:
if 'vip' in each['class']:
Also, when this line of your code runs:
title_box = soup.findAll('a', attrs={'class': 'vip'}}
the title_box
list is already populated with a
elements that have a class
attribute of vip
. So, you don't have to check the same condition again in the for loop.
This is the code you should try:
for each in title_box:
title = each.text.strip().encode('utf-8')
print title
Of course, you can do away with assigning the text to a variable altogether and print it directly:
print each.text.strip().encode('utf-8')
Upvotes: 0
Reputation: 21643
I made an HTML file consisting of five copies of your a
element and called it 'temp.htm':
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
<a class="vip" href="http://www.ebay.co.uk/itm/KITCHENAID-CLASSIC-MIXER-5K45SS-ATTACHMENTS-AND-INSTRUCTIONS-/302468759209?hash=item466c8afea9:g:2PIAAOSwCi9Zvk2D" title="Click this link to access KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS">KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS</a>
Then I ran this code to get the texts in those links:
>>> page = open('temp.htm').read()
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> for link in soup.select('.vip'):
... link.text
...
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
'KITCHENAID CLASSIC MIXER 5K45SS - ATTACHMENTS AND INSTRUCTIONS'
You might still need to encode these texts for deposit in your csv file.
Upvotes: 1