parse many xml objects and write them into a text file (csv, tab-del, ...) using python

Question

I have many XML objects of the following format:


  2012-10-25T03:09:50.817Z
  Success
  E795_CORE_BUNDLED_15430047_R1
  795
  
     ...
     330810813385
     2012-10-25T04:32:37.000Z
     Paypal Prefered
     ...
     ...
     ...
     177
     
     Computers/Tablets & Networking:Laptops & Netbooks:PC Laptops & Netbooks
     
     2
     294.99
     Active
     PT1H22M47S
     
     HP Compaq ZD8000 3800Mhz Full Loaded Ready to go, nice unit & super fast Laptop
     
     
     23.99
     Flat
     23.99
     
     
        
           Operating System
           Windows XP Professional
        
        
           Screen Size
           17.0
        
        
           Processor Type
           Intel Pentium 4 HT
        
     
     US
     false
     2500
     Seller refurbished

For each xml object I'd like to obtain all the item tag tags, such as itemid, endtime, etc.. and also all the item specifics tag tags, such as Operating System, Screen Size, etc. I'd like to save this into memory for each xml object into an appropriate data structure (object). Finally, I'd like to write all the information for all the xml objects into a csv file.

The difficulty is that a priori I do NOT know know what the columns (header) for the csv file will be. For the first xml object I'd create as many columns as the number of subtags the item and item specifics combined have.

Then I'd add more and more columns as new columns appear for new items, adding NAs for the columns that haven't appeared before.

I am looking for advice on how to process the xml objects, which data structures to convert (save) the xml objects, and also how to write all the finally processed xml data into a csv file.

Thanks.

Pykler · Accepted Answer

For each row in your csv you should create a dictionary. As you parse the xml, you should populate this dictionary for each from your snippet. While you do that you should keep a set of keys i.e. columns ... that way by the end of the file you will know how many columns you have and their titles.

Here is a little snippet of how this can be done (if the data fits in memory), I will be using BeautifulSoup since you mentioned it in your tags and its awesome:

import sys
import csv

from BeautifulSoup import BeautifulSoup as Soup

doc = Soup(xml_string)
data = []
cols = set()
for item in doc.findAll('item'):
    d = {}
    for sub in item:
        if hasattr(sub, 'name'):
            d[sub.name] = sub.text
    data.append(d)
    cols = cols.union(d.keys())

cw = csv.writer(sys.stdout)
cw.writerow(cols)
for row in data:
    cw.writerow([row.get(k, 'N/A') for k in cols])

Note, that this solution assumes that your keys are unique, but in your example the item has two picture urls, if you want both to show up it is possible (since nothing is impossible) just a little more complex.

If the data does not fit in memory, you will need to do a two pass, the first pass to gather the keys and the second to print the csv ... note, you should replace BeautifulSoup with another parser like sax in this case since the data does not fit in memory

parse many xml objects and write them into a text file (csv, tab-del, ...) using python

Answers (1)

Related Questions