abdfahim
abdfahim

Reputation: 2553

Optimized use of Python Dictionary

I have a large xml file which I need to convert to tab delimited format as shown below.

Right now I reach to the point that I can make each block into a separate tab delimited file. Now my challenge is make one combined file containing all data.

To do that, I was thinking of using Python dictionary and storing data in loop and later convert that dictionary to file. For example,

dict[x] = {'c1':'x1'}
dict[x] = {'c2':'x2'}
dict[y] = {'c1':'y1'}
..................
..................

But I am afraid of any memory issue because I might have thousands Names with hundreds Columns

Anybody have any better idea please?

XML FORMAT

<item>
    <col>c1</col>
    <col>c2</col>
    <col>c3</col>
    <mh>
        <name>x</name>
        <val>x1</val>
        <val>x2</val>
        <val>x3</val>
    </mh>
    <mh>
        <name>y</name>
        <val>y1</val>
        <val>y2</val>
        <val>y3</val>
    </mh>
    <mh>
        <name>z</name>
        <val>z1</val>
        <val>z2</val>
        <val>z3</val>
    </mh>
</item>
<item>
    <col>c4</col>
    <col>c5</col>
    <mh>
        <name>x</name>
        <val>x4</val>
        <val>x5</val>
    </mh>
    <mh>
        <name>y</name>
        <val>y4</val>
        <val>y5</val>
    </mh>
    <mh>
        <name>z</name>
        <val>z4</val>
        <val>z5</val>
    </mh>
</item>

MY CURRENT OUTPUT

FILE1:
name    |   c1  |   c2  |   c3  
x       |   x1  |   x2  |   x3  
y       |   y1  |   y2  |   y3  
z       |   z1  |   z2  |   z3  
FILE2:
name    |   c4  |   c5
x       |   x4  |   x5
y       |   y4  |   y5
z       |   z4  |   z5

MY INTENDED OUTPUT

name    |   c1  |   c2  |   c3  |   c4  |   c5
x       |   x1  |   x2  |   x3  |   x4  |   x5
y       |   y1  |   y2  |   y3  |   y4  |   y5
z       |   z1  |   z2  |   z3  |   z4  |   z5

Upvotes: 0

Views: 97

Answers (2)

Jon Betts
Jon Betts

Reputation: 3328

It seems to me the core of the problem is that you can't write out the first line until you've read right to the end of your XML file.

There are a couple ways to mitigate this, but I think the main one that stands out to me is, are your columns really rows? If your data looked like this:

name    | x  | y  | z
c1      | x1 | y1 | z1
c2      | x2 | y2 | z2
c2      | x3 | y3 | z3
...

Then you could write the rows to a file as soon as you hit the end of one block.

Assuming however that you must have the format you specified, and that memory really is an issue, then there are few things you can do to help:

Use lists, not dicts

Instead of having:

d['x'] = { 'c1': 'x1', 'c2': 'x2', ... }
d['y'] = { 'c1': 'y1', 'c2': 'y2', ... }
...

Have:

d['names'] = [ 'c1', 'c2', ... ]
d['x']     = [ 'x1', 'x2', ... ]
d['y']     = [ 'y1', 'y2', ... ]
...

You don't end up repeating the keys a lot of times, and the data reflects how you want to write it out. The savings are pretty minimal, but it would be easier to make the CSV.

Use a streaming parser

Chances are with most parsers you already have the full XML loaded into memory, which will probably dwarf the data even if you held it all again. Have a look into a streaming XML parser which moves through the file and only keeps the bit you are looking at in memory.

You register rules about what to do when each component is seen. So for example if you see <item> you know you have to clear down your columns and expect <col> in the near future. The down side is streaming parsers are usually harder to work with.

Parse it multiple times

The holy grail is to never have the data in memory all at once at all. You could achieve this by parsing the file once for each row.

The first parse would write out only the names row, the second x etc. This might be slower, but it means you store the absolute minimum in memory at any given time. Combine this with a streaming parser and you could parse gigabytes (albeit slowly).

Upvotes: 1

Eli
Eli

Reputation: 38919

You don't need to do the entire conversion at once. You can just do this item by item. Read in all the XML up to the end of the first item, convert that to the tab delimited format, and write to file. Then do the next one. That way, you'll never have more than one item in memory.

Upvotes: 0

Related Questions