user1915050
user1915050

Reputation:

Issue in scraping data from a html page using beautiful soup

I am scraping some data from a website and I am able to do so using the below referred code:

import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
    oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
    items = soup.findAll('div',{"class":"title"})
    prices = soup.findAll('span', {"class": "handset"})
    for oem, item, price in zip(oems, items, prices):
            textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
            if textcontent:
                    spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])

Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is <span class="handset"> FREE to £79.99</span>

For those 2 values structure is <span class="handset"> <span class="delivery_amber">Up to 7 days delivery</span> <br>"FREE on all tariffs"</span>

Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure

Please help me solve this issue, Pardon my ignorance as I am new to programming.

Upvotes: 1

Views: 135

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121406

Just detect those 2 items with an additional if statement:

if price.string is None:
    price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
    price_text = unicode(price.string).strip().encode('utf8')

then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.

Upvotes: 1

Related Questions