Extract and group elements/tags with BeautifulSoup

Question

I have a file that contains the following type and structure of data:


    A
    B
    
        EXAMPLE ONE
        
            1
            2
        
        
            
            
            
                
           
        
    
    
        EXAMPLE TWO
        
            3
            4
        
        
            
            
            
                
           
        
   


    C
    D
    
        EXAMPLE
        
            1
            2

The data continues in this exact structure in the file with the exception of the inner most ... tags that can and is repeated n times, the data structure always starts with a tag and then continues with the ... and ... tags.

What i want to do is to extract all the data between the outer most tags with the and as a description of the data blocks. I of course also want to seperate the inner most tags from each other and save this data in a way so that it's clear that the outer most data is related to the parent data.

I don't have a exact idea of how i want to save the data so any examples is appreciated!

I'm testing this with the Python module BeautifulSoup and have searched and read a lot of examples here but haven't found anything that can point me into the correct direction.

Thanks!

daedalus · Accepted Answer

The fact that you are doubling the tag name as the container of your records as well as an element inside creates problems. BeautifulSoup is forgiving of such issues and here is a way you may want to use in case you cannot go back and change the XML structure.

Assign the data to a variable. This may be read in from text file, of course:

data = '''
    A
    B
    
        EXAMPLE ONE
        
            1
            2
        
        
            
            
            
                
           
        
    
    
        EXAMPLE TWO
        
            3
            4
        
        
            
            
            
                
           
        
   


    C
    D
    
        EXAMPLE
        
            1
            2
        
        
            
            
            
                
           
        
    
 '''

Process the data:

from BeautifulSoup import BeautifulSoup
from pprint import pprint

store = {}
key = ()

soup = BeautifulSoup(data)

recs = soup.findAll('data')

for rec in recs:
    if rec.find('from'):
        key = (rec.find('from').text, 
               rec.find('to').text)
    else:
        item = {}
        item['name'] = rec.find('name').text
        item['some_data'] = rec.find('info').find('some_data').text
        item['more_data'] = rec.find('info').find('more_data').text
        if store.has_key(key):
            store[key].append(item)
        else:
            store[key] = [ item ]

pprint(store)

And the result with this dummy data:

{(u'A', u'B'): [{'more_data': u'2',
                 'name': u'EXAMPLE ONE',
                 'some_data': u'1'},
                {'more_data': u'4',
                 'name': u'EXAMPLE TWO',
                 'some_data': u'3'}],
 (u'C', u'D'): [{'more_data': u'2', 'name': u'EXAMPLE', 'some_data': u'1'}]}

Extract and group elements/tags with BeautifulSoup

Answers (1)

Related Questions