Reputation: 7832
I have many XML objects of the following format:
<GetSingleItemResponse xmlns="urn:ebay:apis:eBLBaseComponents">
<Timestamp>2012-10-25T03:09:50.817Z</Timestamp>
<Ack>Success</Ack>
<Build>E795_CORE_BUNDLED_15430047_R1</Build>
<Version>795</Version>
<Item>
<Description>...</Description>
<ItemID>330810813385</ItemID>
<EndTime>2012-10-25T04:32:37.000Z</EndTime>
<Location>Paypal Prefered</Location>
<GalleryURL>...</GalleryURL>
<PictureURL>...</PictureURL>
<PictureURL>...</PictureURL>
<PrimaryCategoryID>177</PrimaryCategoryID>
<PrimaryCategoryName>
Computers/Tablets & Networking:Laptops & Netbooks:PC Laptops & Netbooks
</PrimaryCategoryName>
<BidCount>2</BidCount>
<ConvertedCurrentPrice currencyID="USD">294.99</ConvertedCurrentPrice>
<ListingStatus>Active</ListingStatus>
<TimeLeft>PT1H22M47S</TimeLeft>
<Title>
HP Compaq ZD8000 3800Mhz Full Loaded Ready to go, nice unit & super fast Laptop
</Title>
<ShippingCostSummary>
<ShippingServiceCost currencyID="USD">23.99</ShippingServiceCost>
<ShippingType>Flat</ShippingType>
<ListedShippingServiceCost currencyID="USD">23.99</ListedShippingServiceCost>
</ShippingCostSummary>
<ItemSpecifics>
<NameValueList>
<Name>Operating System</Name>
<Value>Windows XP Professional</Value>
</NameValueList>
<NameValueList>
<Name>Screen Size</Name>
<Value>17.0</Value>
</NameValueList>
<NameValueList>
<Name>Processor Type</Name>
<Value>Intel Pentium 4 HT</Value>
</NameValueList>
</ItemSpecifics>
<Country>US</Country>
<AutoPay>false</AutoPay>
<ConditionID>2500</ConditionID>
<ConditionDisplayName>Seller refurbished</ConditionDisplayName>
</Item>
</GetSingleItemResponse>
For each xml object I'd like to obtain all the item tag tags, such as itemid, endtime, etc.. and also all the item specifics tag tags, such as Operating System, Screen Size, etc. I'd like to save this into memory for each xml object into an appropriate data structure (object). Finally, I'd like to write all the information for all the xml objects into a csv file.
The difficulty is that a priori I do NOT know know what the columns (header) for the csv file will be. For the first xml object I'd create as many columns as the number of subtags the item and item specifics combined have.
Then I'd add more and more columns as new columns appear for new items, adding NAs for the columns that haven't appeared before.
I am looking for advice on how to process the xml objects, which data structures to convert (save) the xml objects, and also how to write all the finally processed xml data into a csv file.
Thanks.
Upvotes: 1
Views: 1991
Reputation: 14855
For each row in your csv you should create a dictionary. As you parse the xml, you should populate this dictionary for each <Item>
from your snippet. While you do that you should keep a set of keys i.e. columns ... that way by the end of the file you will know how many columns you have and their titles.
Here is a little snippet of how this can be done (if the data fits in memory), I will be using BeautifulSoup since you mentioned it in your tags and its awesome:
import sys
import csv
from BeautifulSoup import BeautifulSoup as Soup
doc = Soup(xml_string)
data = []
cols = set()
for item in doc.findAll('item'):
d = {}
for sub in item:
if hasattr(sub, 'name'):
d[sub.name] = sub.text
data.append(d)
cols = cols.union(d.keys())
cw = csv.writer(sys.stdout)
cw.writerow(cols)
for row in data:
cw.writerow([row.get(k, 'N/A') for k in cols])
Note, that this solution assumes that your keys are unique, but in your example the item has two picture urls, if you want both to show up it is possible (since nothing is impossible) just a little more complex.
If the data does not fit in memory, you will need to do a two pass, the first pass to gather the keys and the second to print the csv ... note, you should replace BeautifulSoup with another parser like sax in this case since the data does not fit in memory
Upvotes: 2