Reputation: 1457
I have a file that contains the following type and structure of data:
<data>
<from>A</from>
<to>B</to>
<data>
<name>EXAMPLE ONE</name>
<info>
<some_data>1</some_data>
<more_data>2</more_data>
</info>
<random>
<some_tag>
</foobar>
<foo>
<bar />
</foo>
</random>
</data>
<data>
<name>EXAMPLE TWO</name>
<info>
<some_data>3</some_data>
<more_data>4</more_data>
</info>
<random>
<some_tag>
</foobar>
<foo>
<bar />
</foo>
</random>
</data>
</data>
<data>
<from>C</from>
<to>D</to>
<data>
<name>EXAMPLE</name>
<info>
<some_data>1</some_data>
<more_data>2</more_data>
</info>
<random>
<some_tag>
</foobar>
<foo>
<bar />
</foo>
</random>
</data>
</data>
The data continues in this exact structure in the file with the exception of the inner most <data>...</data>
tags that can and is repeated n times, the data structure always starts with a <data>
tag and then continues with the <from>...</from>
and <to>...</to>
tags.
What i want to do is to extract all the data between the outer most <data>
tags with the <to>
and <from>
as a description of the data blocks. I of course also want to seperate the inner most <data>
tags from each other and save this data in a way so that it's clear that the outer most data is related to the parent data.
I don't have a exact idea of how i want to save the data so any examples is appreciated!
I'm testing this with the Python module BeautifulSoup and have searched and read a lot of examples here but haven't found anything that can point me into the correct direction.
Thanks!
Upvotes: 0
Views: 2182
Reputation: 10923
The fact that you are doubling the tag name <data>
as the container of your records as well as an element inside creates problems. BeautifulSoup
is forgiving of such issues and here is a way you may want to use in case you cannot go back and change the XML structure.
Assign the data to a variable. This may be read in from text file, of course:
data = '''<data>
<from>A</from>
<to>B</to>
<data>
<name>EXAMPLE ONE</name>
<info>
<some_data>1</some_data>
<more_data>2</more_data>
</info>
<random>
<some_tag>
</foobar>
<foo>
<bar />
</foo>
</random>
</data>
<data>
<name>EXAMPLE TWO</name>
<info>
<some_data>3</some_data>
<more_data>4</more_data>
</info>
<random>
<some_tag>
</foobar>
<foo>
<bar />
</foo>
</random>
</data>
</data>
<data>
<from>C</from>
<to>D</to>
<data>
<name>EXAMPLE</name>
<info>
<some_data>1</some_data>
<more_data>2</more_data>
</info>
<random>
<some_tag>
</foobar>
<foo>
<bar />
</foo>
</random>
</data>
</data>'''
Process the data:
from BeautifulSoup import BeautifulSoup
from pprint import pprint
store = {}
key = ()
soup = BeautifulSoup(data)
recs = soup.findAll('data')
for rec in recs:
if rec.find('from'):
key = (rec.find('from').text,
rec.find('to').text)
else:
item = {}
item['name'] = rec.find('name').text
item['some_data'] = rec.find('info').find('some_data').text
item['more_data'] = rec.find('info').find('more_data').text
if store.has_key(key):
store[key].append(item)
else:
store[key] = [ item ]
pprint(store)
And the result with this dummy data:
{(u'A', u'B'): [{'more_data': u'2',
'name': u'EXAMPLE ONE',
'some_data': u'1'},
{'more_data': u'4',
'name': u'EXAMPLE TWO',
'some_data': u'3'}],
(u'C', u'D'): [{'more_data': u'2', 'name': u'EXAMPLE', 'some_data': u'1'}]}
Upvotes: 1