Reputation: 305
I have an xml page with information as follows:
<currency xmlns:xxsi>
<Observation>
<Currency_name>U.S. dollar </Currency_name>
<Observation_ISO4217>USD</Observation_ISO4217>
<Observation_date>2015-03-09</Observation_date>
<Observation_data>1.2598</Observation_data>
<Observation_data_reciprocal>0.7938</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>U.S. dollar </Currency_name>
<Observation_ISO4217>USD</Observation_ISO4217>
<Observation_date>2015-03-11</Observation_date>
<Observation_data>1.2764</Observation_data>
<Observation_data_reciprocal>0.7835</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>Argentine peso</Currency_name>
<Observation_ISO4217>ARS</Observation_ISO4217>
<Observation_date>2015-03-09</Observation_date>
<Observation_data>0.1438</Observation_data>
<Observation_data_reciprocal>6.9541</Observation_data_reciprocal>
</Observation>
<Observation>
<Currency_name>Argentine peso</Currency_name>
<Observation_ISO4217>ARS</Observation_ISO4217>
<Observation_date>2015-03-10</Observation_date>
<Observation_data>0.1440</Observation_data>
<Observation_data_reciprocal>6.9444</Observation_data_reciprocal>
</Observation>
</currency>
I want a way to process the data so I can get information out of it, such as if I wanted to compare the two dates of the same currency, or if I want to compare the currency of two different countries. The problem I am having is trying to get that information into a dictionary as a good way to store it.
I am using the following code currently, but it wont work due to the multiple data of the same countries. The actual page has five (5) of the same countries for every country (total of 57)
class myHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.country = []
self.data = []
self.dic = {}
self.nameFlag = False
def handle_starttag(self, tag, attrs):
if tag == 'currency_name':
self.nameFlag = True
else:
self.nameFlag = False
def handle_endtag(self, tag):
pass
def handle_data(self, data):
if data.strip() != '' and self.nameFlag == True:
self.dic[data.strip()] = []
Can someone help me get a good way to store the data for multiple countries?
Upvotes: 1
Views: 532
Reputation: 52000
Assuming you don't have nested elements in your markup language, you can start from a simple parser like that:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.content = []
self.observation = False
self.element = None
def handle_starttag(self, tag, attrs):
print(tag)
if tag == 'observation':
self.content.append({})
self.observation = True
elif self.observation:
self.element = tag
self.content[-1][self.element] = ""
def handle_endtag(self, tag):
if tag == 'observation':
self.observation = False
self.element = None
def handle_data(self, data):
if self.element:
self.content[-1][self.element] += data
from pprint import pprint
with open("data.someml", "rt") as infile:
parser = MyHTMLParser()
parser.feed(infile.read())
pprint(parser.content)
Given your input file, this will produce:
[{'currency_name': 'U.S. dollar ',
'observation_data': '1.2598',
'observation_data_reciprocal': '0.7938',
'observation_date': '2015-03-09',
'observation_iso4217': 'USD'},
{'currency_name': 'U.S. dollar ',
'observation_data': '1.2764',
'observation_data_reciprocal': '0.7835',
'observation_date': '2015-03-11',
'observation_iso4217': 'USD'},
{'currency_name': 'Argentine peso',
'observation_data': '0.1438',
'observation_data_reciprocal': '6.9541',
'observation_date': '2015-03-09',
'observation_iso4217': 'ARS'},
{'currency_name': 'Argentine peso',
'observation_data': '0.1440',
'observation_data_reciprocal': '6.9444',
'observation_date': '2015-03-10',
'observation_iso4217': 'ARS'}]
The key idea here is to create a new record (as a dictionary) each time we encounter an observation
start tag. Given the assumption explained before, any other start tag will introduce a data field.
Upvotes: 1
Reputation: 30210
If you don't care about how you parse the XML, I'd suggest using Martin Blech's xmltodict
module.
Since your file is missing a single document element, you'll need to coax it into cooperating with something like:
import xmltodict
with open('input.txt') as f:
data = f.read()
d = xmltodict.parse("<root>" + data + "</root>")
d = d['root']
Then you could access the XML structure using things like:
print(d['Observation'][0]['Currency_name']) # U.S. dollar
print(d['Observation'][0]['Observation_date']) # 2015-03-09
Or, to loop over all the observations:
for obs in d['Observation']:
print(obs['Currency_name'])
print(obs['Observation_date'])
print(obs['Observation_data'])
print('---')
Output:
U.S. dollar 2015-03-09 1.2598 --- U.S. dollar 2015-03-11 1.2764 --- Argentine peso 2015-03-09 0.1438 --- Argentine peso 2015-03-10 0.1440 ---
Upvotes: 0