plzhelpmi
plzhelpmi

Reputation: 37

Extracting data with BeautifulSoup and output to CSV

As mentioned in the previous questions, I am using Beautiful soup with python to retrieve weather data from a website.

Here's how the website looks like:

<channel>
<title>2 Hour Forecast</title>
<source>Meteorological Services Singapore</source>
<description>2 Hour Forecast</description>
<item>
<title>Nowcast Table</title>
<category>Singapore Weather Conditions</category>
<forecastIssue date="18-07-2016" time="03:30 PM"/>
<validTime>3.30 pm to 5.30 pm</validTime>
<weatherForecast>
<area forecast="TL" lat="1.37500000" lon="103.83900000" name="Ang Mo Kio"/>
<area forecast="SH" lat="1.32100000" lon="103.92400000" name="Bedok"/>
<area forecast="TL" lat="1.35077200" lon="103.83900000" name="Bishan"/>
<area forecast="CL" lat="1.30400000" lon="103.70100000" name="Boon Lay"/>
<area forecast="CL" lat="1.35300000" lon="103.75400000" name="Bukit Batok"/>
<area forecast="CL" lat="1.27700000" lon="103.81900000" name="Bukit Merah"/>` 
<channel>

I managed to retrieve the information I need using these codes :

import requests
from bs4 import BeautifulSoup
import urllib3

#getting the ValidTime

r = requests.get('http://www.nea.gov.sg/api/WebAPI/?   
dataset=2hr_nowcast&keyref=781CF461BB6606AD907750DFD1D07667C6E7C5141804F45D')
soup = BeautifulSoup(r.content, "xml")
time = soup.find('validTime').string
print "validTime: " + time

#getting the date

for currentdate in soup.find_all('item'):
    element = currentdate.find('forecastIssue')
    print "date: " + element['date']

#getting the time

for currentdate in soup.find_all('item'):
    element = currentdate.find('forecastIssue')
    print "time: " + element['time'] 

for area in soup.find('weatherForecast').find_all('area'):
    area_attrs_li = [area.attrs for area in soup.find('weatherForecast').find_all('area')]
    print area_attrs_li

Here are my results :

{'lat': u'1.34039000', 'lon': u'103.70500000', 'name': u'Jurong West',   
'forecast': u'LR'}, {'lat': u'1.31200000', 'lon': u'103.86200000', 'name':  
 u'Kallang', 'forecast': u'LR'},
  1. How do I remove u' from the result? I tried using the method I found while googling but it doesn't seem to work

I'm not strong in Python and have been stuck at this for quite a while.

EDIT : I tried doing this :

f = open("C:\\scripts\\nea.csv" , 'wt')

try:
 for area in area_attrs_li:
 writer = csv.writer(f)
 writer.writerow( (time, element['date'], element['time'], area_attrs_li))

finally:
  f.close()

print open("C:/scripts/nea.csv", 'rt').read()   

It worked however, I would like to split the area apart as the records are duplicates in the CSV :

records in the CSV

Thank you.

Upvotes: 0

Views: 613

Answers (1)

user2853437
user2853437

Reputation: 780

EDIT 1 -Topic:

You're missing escape characters:

C:\scripts>python neaweather.py
File "neaweather.py", line 30
writer.writerow( ('time', 'element['date']', 'element['time']', 'area_attrs_li') )

writer.writerow( ('time', 'element[\'date\']', 'element[\'time\']', 'area_attrs_li') 
                                   ^

SyntaxError: invalid syntax

EDIT 2:

if you want to insert values:

writer.writerow( (time, element['date'], element['time'], area_attrs_li) )

EDIT 3:

to split the result to different lines:

for area in area_attrs_li:
    writer.writerow( (time, element['date'], element['time'], area)

EDIT 4: The splitting is not correct at all, but it shall give a better understanding of how to parse and split data to change it for your needs. enter image description here to split the area element again as you show in your image, you can parse it

for area in area_attrs_li:
    # cut off the characters you don't need
    area = area.replace('[','')
    area = area.replace(']','')
    area = area.replace('{','')
    area = area.replace('}','')

    # remove other characters
    area = area.replace("u'","\"").replace("'","\"")

    # split the string into a list
    areaList = area.split(",")

    # create your own csv-seperator
    ownRowElement = ';'.join(areaList)

    writer.writerow( (time, element['date'], element['time'], ownRowElement)

Offtopic: This works for me:

import csv
import json

x="""[ 
    {'lat': u'1.34039000', 'lon': u'103.70500000', 'name': u'Jurong West','forecast': u'LR'}
]"""

jsontxt = json.loads(x.replace("u'","\"").replace("'","\""))

f = csv.writer(open("test.csv", "w+"))

# Write CSV Header, If you dont need that, remove this line
f.writerow(['lat', 'lon', 'name', 'forecast'])

for jsontext in jsontxt:
    f.writerow([jsontext["lat"], 
                jsontext["lon"], 
                jsontext["name"], 
                jsontext["forecast"],
                ])

Upvotes: 1

Related Questions