Cyril Schmitt
Cyril Schmitt

Reputation: 47

Python and Pandas : XML --> DataFrame

I am right now working on a database and I would like to go from XML to a Pandas DataFrame and I've been stuck for a long time now. I have no idea how to solve this problem.

j=0
for rows in root.findall('row'):
    i=0
    for cells in root.findall('cell') in rows:
        if i==0: 
            #Name of the country is on the 0-th tag "cell" of each "row"
            country[j]=cells.text
        elif i==17: 
            #Number of students is on the 17-th tag "cell" of each "row" 
            numberStudent[j]=cells.text
        i=i+1
    j=j+1
Data=pd.DataFrame({'country': [country], 'number of student': [numberStudent]})

When I'm trying to read Data, there's only a dataframe with a value 0 for country and 0 for numberStudent. I don't understand what is wrong. I have been looking for answer already on this forum but I'm still stuck.

Also, I am not sure I am doing right. I would like to find the 0-th ans 17-th tag "cell" in each parent tag "row". Is it right to do use two times "in" in one declaration for the second for?

Upvotes: 0

Views: 438

Answers (1)

Jesse
Jesse

Reputation: 815

To find all the cells within a row you should call findall on the row in the inner loop, not on the root.

country = []
numberStudent = []
for row in root.findall('row'):
    i=0
    for cell in row.findall('cell'):
        if i==0:
            country.append(cell.text)
        if i==17:
            numberStudent.append(cell.text)
        i=i+1
data=pd.DataFrame({'country': country, 'number of student': numberStudent})

However, as written your code should produce an error, so I suspect you are not finding any row nodes either. If your row nodes are not a child of the root, you'll need to call root.findall('.//row'), though without seeing your xml, it's not possible to know if that's your issue.

One other alternative is that ElementTree has support for finding numbered elements, so you could also do

country = [cell.text for cell in root.findall('.//row/cell[1]')]
numberStudent = [cell.text for cell in root.findall('.//row/cell[18]')]
data=pd.DataFrame({'country': country, 'number of student': numberStudent})

root.findall('.//row/cell[n]') will find any cell element that is the nth child of a row element. Just beware, ElementTree uses one-based indexing instead of the standard python zero based indexing.

Upvotes: 1

Related Questions