Torched90
Torched90

Reputation: 305

HTML Parser on XML page, getting a dictionary of data

I have an xml page with information as follows:

<currency xmlns:xxsi>
<Observation>
    <Currency_name>U.S. dollar </Currency_name>
    <Observation_ISO4217>USD</Observation_ISO4217>
    <Observation_date>2015-03-09</Observation_date>
    <Observation_data>1.2598</Observation_data>
    <Observation_data_reciprocal>0.7938</Observation_data_reciprocal>
</Observation>
<Observation>
    <Currency_name>U.S. dollar </Currency_name>
    <Observation_ISO4217>USD</Observation_ISO4217>
    <Observation_date>2015-03-11</Observation_date>
    <Observation_data>1.2764</Observation_data>
    <Observation_data_reciprocal>0.7835</Observation_data_reciprocal>
</Observation>
<Observation>
    <Currency_name>Argentine peso</Currency_name>
    <Observation_ISO4217>ARS</Observation_ISO4217>
    <Observation_date>2015-03-09</Observation_date>
    <Observation_data>0.1438</Observation_data>
    <Observation_data_reciprocal>6.9541</Observation_data_reciprocal>
</Observation>
<Observation>
    <Currency_name>Argentine peso</Currency_name>
    <Observation_ISO4217>ARS</Observation_ISO4217>
    <Observation_date>2015-03-10</Observation_date>
    <Observation_data>0.1440</Observation_data>
    <Observation_data_reciprocal>6.9444</Observation_data_reciprocal>
</Observation>
</currency>

I want a way to process the data so I can get information out of it, such as if I wanted to compare the two dates of the same currency, or if I want to compare the currency of two different countries. The problem I am having is trying to get that information into a dictionary as a good way to store it.

I am using the following code currently, but it wont work due to the multiple data of the same countries. The actual page has five (5) of the same countries for every country (total of 57)

class myHTMLParser(HTMLParser):

def __init__(self):
    HTMLParser.__init__(self)
    self.country = []
    self.data = []  
    self.dic = {}
    self.nameFlag = False

def handle_starttag(self, tag, attrs):
    if tag == 'currency_name':
        self.nameFlag = True
    else:
        self.nameFlag = False

def handle_endtag(self, tag):
    pass

def handle_data(self, data):
    if data.strip() != '' and self.nameFlag == True:
        self.dic[data.strip()] = []

Can someone help me get a good way to store the data for multiple countries?

Upvotes: 1

Views: 532

Answers (2)

Sylvain Leroux
Sylvain Leroux

Reputation: 52000

Assuming you don't have nested elements in your markup language, you can start from a simple parser like that:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.content = []
        self.observation = False
        self.element = None

    def handle_starttag(self, tag, attrs):
        print(tag)
        if tag == 'observation':
            self.content.append({})
            self.observation = True
        elif self.observation:
            self.element = tag
            self.content[-1][self.element] = ""

    def handle_endtag(self, tag):
        if tag == 'observation':
            self.observation = False

        self.element = None

    def handle_data(self, data):
        if self.element:
            self.content[-1][self.element] += data

from pprint import pprint

with open("data.someml", "rt") as infile:
    parser = MyHTMLParser()
    parser.feed(infile.read())

    pprint(parser.content)

Given your input file, this will produce:

[{'currency_name': 'U.S. dollar ',
  'observation_data': '1.2598',
  'observation_data_reciprocal': '0.7938',
  'observation_date': '2015-03-09',
  'observation_iso4217': 'USD'},
 {'currency_name': 'U.S. dollar ',
  'observation_data': '1.2764',
  'observation_data_reciprocal': '0.7835',
  'observation_date': '2015-03-11',
  'observation_iso4217': 'USD'},
 {'currency_name': 'Argentine peso',
  'observation_data': '0.1438',
  'observation_data_reciprocal': '6.9541',
  'observation_date': '2015-03-09',
  'observation_iso4217': 'ARS'},
 {'currency_name': 'Argentine peso',
  'observation_data': '0.1440',
  'observation_data_reciprocal': '6.9444',
  'observation_date': '2015-03-10',
  'observation_iso4217': 'ARS'}]

The key idea here is to create a new record (as a dictionary) each time we encounter an observation start tag. Given the assumption explained before, any other start tag will introduce a data field.

Upvotes: 1

jedwards
jedwards

Reputation: 30210

If you don't care about how you parse the XML, I'd suggest using Martin Blech's xmltodict module.

Since your file is missing a single document element, you'll need to coax it into cooperating with something like:

import xmltodict

with open('input.txt') as f:
    data = f.read()
    d = xmltodict.parse("<root>" + data + "</root>")

d = d['root']

Then you could access the XML structure using things like:

print(d['Observation'][0]['Currency_name'])     # U.S. dollar
print(d['Observation'][0]['Observation_date'])  # 2015-03-09

Or, to loop over all the observations:

for obs in d['Observation']:
    print(obs['Currency_name'])
    print(obs['Observation_date'])
    print(obs['Observation_data'])
    print('---')

Output:

U.S. dollar
2015-03-09
1.2598
---
U.S. dollar
2015-03-11
1.2764
---
Argentine peso
2015-03-09
0.1438
---
Argentine peso
2015-03-10
0.1440
---

Upvotes: 0

Related Questions