Reputation: 2904
I'm trying to read one excel file which looks like below:
I also have one script which will convert this xlsx file into the csv files with its sheet names (If three sheets available then it will create three different csv files).
It's csv file looks like below:
Unnamed: 0,Gender A,Unnamed: 2,Gender B,Unnamed: 4,Gender C,Gender D
date,Male,Female,Male,Female,Male,Female
2017-01-01 00:00:00,2,3,3,2,3,3
2017-01-02 00:00:00,5,7,7,42,3,5
2017-01-03 00:00:00,4,6,6,12,2,7
2017-01-04 00:00:00,6,7,3,6,4,8
2017-01-05 00:00:00,6,8,8,3,5,3
2017-01-06 00:00:00,54,3,3,6,3,5
2017-01-07 00:00:00,3,4,6,3,6,5
2017-01-08 00:00:00,3,6,6,3,6,4
2017-01-09 00:00:00,2,2,8,7,5,2
2017-01-10 00:00:00,4,3,2,4,5,5
2017-01-11 00:00:00,12,10,10,3,1,6
2017-01-12 00:00:00,9,7,7,3,4,1
So, my first question is which is better choice to process these files - xlsx or csv?
Next, I just want to read first two rows as a column header. So that I can understand In which Gender how many males and females are available.
Expected Output:
0 date Gender A_Male Gender A_Female Gender B_Male Gender B_Female Gender C_Male Gender D_Female
1 2017-01-01 00:00:00 2 3 3 2 3 3
2 2017-01-02 00:00:00 5 7 7 42 3 5
3 2017-01-03 00:00:00 4 6 6 12 2 7
4 2017-01-04 00:00:00 6 7 3 6 4 8
5 2017-01-05 00:00:00 6 8 8 3 5 3
6 2017-01-06 00:00:00 54 3 3 6 3 5
7 2017-01-07 00:00:00 3 4 6 3 6 5
8 2017-01-08 00:00:00 3 6 6 3 6 4
9 2017-01-09 00:00:00 2 2 8 7 5 2
10 2017-01-10 00:00:00 4 3 2 4 5 5
11 2017-01-11 00:00:00 12 10 10 3 1 6
12 2017-01-12 00:00:00 9 7 7 3 4 1
Upvotes: 14
Views: 20337
Reputation: 19259
I love @ScottBoston's approach. Here are some cosmetic alternatives. If you want your column headers to look pretty, especially when the second row contains the units for the quantity, you can do something like this:
df = pd.read_excel('Untitled 2.xlsx', header=[0,1], index_col=0)
df.columns = df.columns.map(lambda h: '{}\n({})'.format(h[0], h[1]))
df.rename_axis('Date')
And if you want to make sure your column names don't contain spaces (so you can access them as properties of the DataFrame):
df = pd.read_excel('Untitled 2.xlsx', header=[0,1], index_col=0)
df.columns = df.columns.map(lambda h: ' '.join(h).replace(' ', '_'))
df.rename_axis('Date')
Which gives:
Gender_A__Male Gender_A__Female ... Gender_C__Male Gender_D__Female
Date ...
2017-01-01 00:00:00 2 3 ... 3 3
2017-01-02 00:00:00 5 7 ... 3 5
2017-01-03 00:00:00 4 6 ... 2 7
2017-01-04 00:00:00 6 7 ... 4 8
2017-01-05 00:00:00 6 8 ... 5 3
2017-01-06 00:00:00 54 3 ... 3 5
2017-01-07 00:00:00 3 4 ... 6 5
2017-01-08 00:00:00 3 6 ... 6 4
2017-01-09 00:00:00 2 2 ... 5 2
2017-01-10 00:00:00 4 3 ... 5 5
2017-01-11 00:00:00 12 10 ... 1 6
2017-01-12 00:00:00 9 7 ... 4 1
Upvotes: 6
Reputation: 153460
Let's try:
df = pd.read_excel('Untitled 2.xlsx', header=[0,1])
df.columns = df.columns.map('_'.join)
df.rename_axis('Date').reset_index()
Output:
Date Gender A_Male Gender A_Female Gender B_Male Gender B_Female \
0 2017-01-01 2 3 3 2
1 2017-01-02 5 7 7 42
2 2017-01-03 4 6 6 12
3 2017-01-04 6 7 3 6
4 2017-01-05 6 8 8 3
5 2017-01-06 54 3 3 6
6 2017-01-07 3 4 6 3
7 2017-01-08 3 6 6 3
8 2017-01-09 2 2 8 7
9 2017-01-10 4 3 2 4
10 2017-01-11 12 10 10 3
11 2017-01-12 9 7 7 3
Gender C_Male Gender D_Female
0 3 3
1 3 5
2 2 7
3 4 8
4 5 3
5 3 5
6 6 5
7 6 4
8 5 2
9 5 5
10 1 6
11 4 1
Upvotes: 19