JonDoe
JonDoe

Reputation: 107

Transform Excel two column header from columsn into rows dataframe python

I have a code to transofrm an excel file with data whioch has two headers into a dataframe. Somehow my code is not doing that what i expect:

import pandas as pd

df = pd.read_excel('excel_file.xlsx', sheet_name='Class', header=None, names= header_names, skiprows=2)

somehow I receive a weird output:

NaN NaN NaT        NaN      NaN NaT     
                  3  NaN NaT   
2015-06-30        4  NaN

etc.

I dont know what I am doing wrong.

My excel file is structured as the following but has even more columns.

A                   B                   C   
math                biology             physics 
                            
01.06.2015  3       01.06.2015  3       01.06.2015  3
30.06.2016  4       30.06.2016  4       30.06.2016  4
15.06.2017  1       15.06.2017  3       15.06.2017  3
20.06.2018  3       20.06.2018  2       20.06.2018  5
15.06.2019  2       15.06.2019  3       15.06.2019  2

The expected dataframe output should be like this:

date        grade   course  group
01.06.2015  3       math    A
30.06.2016  4       math    A
15.06.2017  1       math    A
20.06.2018  3       math    A
15.06.2019  2       math    A
01.06.2015  3       biology B
30.06.2016  4       biology B
15.06.2017  3       biology B
20.06.2018  2       biology B
15.06.2019  3       biology B
01.06.2015  3       physics C
30.06.2016  4       physics C
15.06.2017  3       physics C
20.06.2018  5       physics C
15.06.2019  2       physics C

Upvotes: 0

Views: 47

Answers (1)

Maciek Woźniak
Maciek Woźniak

Reputation: 455

just by reading this data you won't get the output you showed. Also, you skip 2 rows which have information about the course and the group. What I would do is:

import pandas as pd
df = pd.read_excel('stack.xlsx',header=None, skiprows=3)

df = pd.DataFrame(df)

0   1   2   3   4   5
0   01.06.2015  3   01.06.2015  3   01.06.2015  3

Now, changing names of columns

df.columns=['math A','grade A', 'Bio B','grade B','physic C', 'grade C']


math A  grade A Bio B   grade B physic C    grade C
0   01.06.2015  3   01.06.2015  3   01.06.2015  3

After that, you can iterate through the lines and create a new data frame with the output you want :)

3 loops like this and you will get what you want, just remember that you cannot initialize with i = 0, but i you will start the next loop, is the i you finished the prev loop

df_wanted = pd.DataFrame(columns=['date','grade','course','group'])
i = 0
for index,row in df[['math A','grade A']].iterrows():
    df_wanted.loc[i,'date'] = row[0]
    df_wanted.loc[i,'grade'] = row[1]
    df_wanted.loc[i,'course'] = 'math'
    df_wanted.loc[i,'group'] = 'A'
    i+=1

date    grade   course  group
0   01.06.2015  3   math    A

Upvotes: 1

Related Questions