jhaywoo8
jhaywoo8

Reputation: 767

Parsing XML into a dataframe

I am having some trouble parsing some XML. This is what the XML looks like.

<listing>
   <seller_info>
       <seller_name> cubsfantony</seller_name>
       <seller_rating> 848</seller_rating>
   </seller_info>
   <payment_types>Visa/MasterCard, Money Order/Cashiers Checks, Personal Checks, See item description for payment methods accepted
   </payment_types>
   <shipping_info>Buyer pays fixed shipping charges, Will ship to United States only
   </shipping_info>
   <buyer_protection_info>
   </buyer_protection_info>
   <auction_info>
     <current_bid>$620.00 </current_bid>
     <time_left> 4 days, 14 hours +  </time_left>
     <high_bidder> 
        <bidder_name> [email protected] </bidder_name>
        <bidder_rating>-2 </bidder_rating>
     </high_bidder>
     <num_items>1 </num_items>
     <num_bids>  12</num_bids>
     <started_at>$1.00 </started_at>
     <bid_increment> </bid_increment>
     <location> USA/Chicago</location>
     <opened> Nov-27-00 04:57:50 PST</opened>
     <closed> Dec-02-00 04:57:50 PST</closed>
     <id_num> 511601118</id_num>
     <notes>  </notes>
   </auction_info>
   <bid_history>
       <highest_bid_amount>$620.00   </highest_bid_amount>
       <quantity> 1</quantity>
   </bid_history>
   <item_info>
      <memory> 256MB PC133 SDram</memory>
      <hard_drive> 30 GB 7200 RPM IDE Hard Drive</hard_drive>
      <cpu>Pentium III 933 System  </cpu>
      <brand> </brand>
      <description> NEW Pentium III 933 System - 133 MHz BUS Speed Pentium Motherboard.....
      </description>
   </item_info>
</listing>

This is my code. I want to take text between the tags and put it into a Pandas dataframe. There are about 20 Listings in the full XML. For this code, I'm just trying to see how I can extract the text by the name of tags but I'm not sure how to go about it

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from lxml import etree


ebay = etree.parse('ebay.xml') 
tree = ebay.getroot()


for child in tree:
    for element in child:
        person_dict = {}
        for more in element:
            if more.text != None:
                person_dict[more] = more.text.strip

Upvotes: 0

Views: 112

Answers (1)

titipata
titipata

Reputation: 5389

Here, I just give an example of how to parse one given listing. If you have multiple listings, you can use for-loop to go through all of them.

from lxml import etree

listing = etree.parse('ebay.xml') 

d = {}
for e in listing.getchildren():
    for c in e.getchildren():
        if len(c.getchildren()) == 0:
            if c.tag is not None:
                d[c.tag] = c.text
        else:
            for ce in c.getchildren():
                if ce.tag is not None:
                    d[ce.tag] = ce.text

From here, you can append d to a list then using pandas in order to convert them into dataframe.

Output looks like the following

{'bid_increment': ' ',
 'bidder_name': ' [email protected] ',
 'bidder_rating': '-2 ',
 'brand': ' ',
  ...
 'seller_name': ' cubsfantony',
 'seller_rating': ' 848',
 'started_at': '$1.00 ',
 'time_left': ' 4 days, 14 hours +  '}

Upvotes: 1

Related Questions