HTML Parser on XML page, getting a dictionary of data

Question

I have an xml page with information as follows:



    U.S. dollar 
    USD
    2015-03-09
    1.2598
    0.7938


    U.S. dollar 
    USD
    2015-03-11
    1.2764
    0.7835


    Argentine peso
    ARS
    2015-03-09
    0.1438
    6.9541


    Argentine peso
    ARS
    2015-03-10
    0.1440
    6.9444

I want a way to process the data so I can get information out of it, such as if I wanted to compare the two dates of the same currency, or if I want to compare the currency of two different countries. The problem I am having is trying to get that information into a dictionary as a good way to store it.

I am using the following code currently, but it wont work due to the multiple data of the same countries. The actual page has five (5) of the same countries for every country (total of 57)

class myHTMLParser(HTMLParser):

def __init__(self):
    HTMLParser.__init__(self)
    self.country = []
    self.data = []  
    self.dic = {}
    self.nameFlag = False

def handle_starttag(self, tag, attrs):
    if tag == 'currency_name':
        self.nameFlag = True
    else:
        self.nameFlag = False

def handle_endtag(self, tag):
    pass

def handle_data(self, data):
    if data.strip() != '' and self.nameFlag == True:
        self.dic[data.strip()] = []

Can someone help me get a good way to store the data for multiple countries?

Sylvain Leroux · Accepted Answer

Assuming you don't have nested elements in your markup language, you can start from a simple parser like that:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.content = []
        self.observation = False
        self.element = None

    def handle_starttag(self, tag, attrs):
        print(tag)
        if tag == 'observation':
            self.content.append({})
            self.observation = True
        elif self.observation:
            self.element = tag
            self.content[-1][self.element] = ""

    def handle_endtag(self, tag):
        if tag == 'observation':
            self.observation = False

        self.element = None

    def handle_data(self, data):
        if self.element:
            self.content[-1][self.element] += data

from pprint import pprint

with open("data.someml", "rt") as infile:
    parser = MyHTMLParser()
    parser.feed(infile.read())

    pprint(parser.content)

Given your input file, this will produce:

[{'currency_name': 'U.S. dollar ',
  'observation_data': '1.2598',
  'observation_data_reciprocal': '0.7938',
  'observation_date': '2015-03-09',
  'observation_iso4217': 'USD'},
 {'currency_name': 'U.S. dollar ',
  'observation_data': '1.2764',
  'observation_data_reciprocal': '0.7835',
  'observation_date': '2015-03-11',
  'observation_iso4217': 'USD'},
 {'currency_name': 'Argentine peso',
  'observation_data': '0.1438',
  'observation_data_reciprocal': '6.9541',
  'observation_date': '2015-03-09',
  'observation_iso4217': 'ARS'},
 {'currency_name': 'Argentine peso',
  'observation_data': '0.1440',
  'observation_data_reciprocal': '6.9444',
  'observation_date': '2015-03-10',
  'observation_iso4217': 'ARS'}]

The key idea here is to create a new record (as a dictionary) each time we encounter an observation start tag. Given the assumption explained before, any other start tag will introduce a data field.

HTML Parser on XML page, getting a dictionary of data

Answers (2)

Related Questions