mikejoh
mikejoh

Reputation: 1457

Extract and group elements/tags with BeautifulSoup

I have a file that contains the following type and structure of data:

<data>
    <from>A</from>
    <to>B</to>
    <data>
        <name>EXAMPLE ONE</name>
        <info>
            <some_data>1</some_data>
            <more_data>2</more_data>
        </info>
        <random>
            <some_tag>
            </foobar>
            <foo>
                <bar />
           </foo>
        </random>
    </data>
    <data>
        <name>EXAMPLE TWO</name>
        <info>
            <some_data>3</some_data>
            <more_data>4</more_data>
        </info>
        <random>
            <some_tag>
            </foobar>
            <foo>
                <bar />
           </foo>
        </random>
   </data>
</data>
<data>
    <from>C</from>
    <to>D</to>
    <data>
        <name>EXAMPLE</name>
        <info>
            <some_data>1</some_data>
            <more_data>2</more_data>
        </info>
        <random>
            <some_tag>
            </foobar>
            <foo>
                <bar />
           </foo>
        </random>
    </data>
 </data>

The data continues in this exact structure in the file with the exception of the inner most <data>...</data> tags that can and is repeated n times, the data structure always starts with a <data> tag and then continues with the <from>...</from> and <to>...</to> tags.

What i want to do is to extract all the data between the outer most <data> tags with the <to> and <from> as a description of the data blocks. I of course also want to seperate the inner most <data> tags from each other and save this data in a way so that it's clear that the outer most data is related to the parent data.

I don't have a exact idea of how i want to save the data so any examples is appreciated!

I'm testing this with the Python module BeautifulSoup and have searched and read a lot of examples here but haven't found anything that can point me into the correct direction.

Thanks!

Upvotes: 0

Views: 2182

Answers (1)

daedalus
daedalus

Reputation: 10923

The fact that you are doubling the tag name <data> as the container of your records as well as an element inside creates problems. BeautifulSoup is forgiving of such issues and here is a way you may want to use in case you cannot go back and change the XML structure.

Assign the data to a variable. This may be read in from text file, of course:

data = '''<data>
    <from>A</from>
    <to>B</to>
    <data>
        <name>EXAMPLE ONE</name>
        <info>
            <some_data>1</some_data>
            <more_data>2</more_data>
        </info>
        <random>
            <some_tag>
            </foobar>
            <foo>
                <bar />
           </foo>
        </random>
    </data>
    <data>
        <name>EXAMPLE TWO</name>
        <info>
            <some_data>3</some_data>
            <more_data>4</more_data>
        </info>
        <random>
            <some_tag>
            </foobar>
            <foo>
                <bar />
           </foo>
        </random>
   </data>
</data>
<data>
    <from>C</from>
    <to>D</to>
    <data>
        <name>EXAMPLE</name>
        <info>
            <some_data>1</some_data>
            <more_data>2</more_data>
        </info>
        <random>
            <some_tag>
            </foobar>
            <foo>
                <bar />
           </foo>
        </random>
    </data>
 </data>'''

Process the data:

from BeautifulSoup import BeautifulSoup
from pprint import pprint

store = {}
key = ()

soup = BeautifulSoup(data)

recs = soup.findAll('data')

for rec in recs:
    if rec.find('from'):
        key = (rec.find('from').text, 
               rec.find('to').text)
    else:
        item = {}
        item['name'] = rec.find('name').text
        item['some_data'] = rec.find('info').find('some_data').text
        item['more_data'] = rec.find('info').find('more_data').text
        if store.has_key(key):
            store[key].append(item)
        else:
            store[key] = [ item ]

pprint(store)

And the result with this dummy data:

{(u'A', u'B'): [{'more_data': u'2',
                 'name': u'EXAMPLE ONE',
                 'some_data': u'1'},
                {'more_data': u'4',
                 'name': u'EXAMPLE TWO',
                 'some_data': u'3'}],
 (u'C', u'D'): [{'more_data': u'2', 'name': u'EXAMPLE', 'some_data': u'1'}]}

Upvotes: 1

Related Questions