Reputation: 719
I have a dataframe like this,
df_nba = pd.DataFrame({'col1': ['name', 'season',np.nan,'points','position','pages',
'name', 'season',np.nan,'points','position','pages',
'name', 'season',np.nan,'points','position','pages',
'name', 'season',np.nan,'points','position','pages',
'name', 'season',np.nan,'points','pages',
'name', 'season',np.nan,'points','position','pages',
'name', 'season',np.nan,'points','position','pages',
'name', 'season',np.nan,'pages'
],
'col2': ['kobe', np.nan,'96-97',7.6,'SF',1,
'kobe', np.nan,'97-98',15.4,'SF',2,
'kobe', np.nan,'98-99',19.9,'SG',3,
'kobe', np.nan,'99-00',22.5,'SG',4,
'lebron', np.nan,'03-04',20.9,5,
'lebron', np.nan,'04-05',27.2,'SF',6,
'lebron', np.nan,'05-06',31.4,'SF',7,
'lebron', np.nan,'06-07',8
]})
col1 is the variable name and col2 is the value corresponding to the left side.
col1 col2
0 name kobe
1 season NaN
2 NaN 96-97
3 points 7.6
4 position SF
5 pages 1
6 name kobe
7 season NaN
8 NaN 97-98
9 points 15.4
10 position SF
11 pages 2
12 name kobe
13 season NaN
14 NaN 98-99
15 points 19.9
16 position SG
17 pages 3
18 name kobe
19 season NaN
20 NaN 99-00
21 points 22.5
22 position SG
23 pages 4
24 name lebron
25 season NaN
26 NaN 03-04
27 points 20.9
28 pages 5
29 name lebron
30 season NaN
31 NaN 04-05
32 points 27.2
33 position SF
34 pages 6
35 name lebron
36 season NaN
37 NaN 05-06
38 points 31.4
39 position SF
40 pages 7
41 name lebron
42 season NaN
43 NaN 06-07
44 pages 8
Can I have the data by transpose the variable name to different columns? The output will be like,
Page name season points position
1 kobe 96-97 7.6 SF
2 kobe 97-98 15.4 SF
3 kobe 98-99 19.9 SG
4 kobe 99-00 22.5 SG
5 lebron 03-04 20.9 NaN
6 lebron 04-05 27.2 SF
7 lebron 05-06 31.4 SF
8 lebron 06-07 NaN NaN
Things I have tried is using group by pages to re-arrange the dataframe,
df_new = pd.DataFrame(
df_nba.groupby((df_nba['col1'] == 'pages').cumsum())['col2'].agg(list).to_list(),
columns=['name', 'xxx', 'season', 'points', 'position','page'] ).drop(columns='xxx')
But it seems output is not correct. I don't know how to handle the missing data in page 5 and page 8. Because it is not just missing values but the variable name also not exist in original dataset.
Upvotes: 1
Views: 343
Reputation: 323246
Check sorted
+ cumcount
then pivot
df = df.apply(lambda x : sorted(x,key=pd.notnull)).dropna()
df['New'] = df.groupby('col1').cumcount()
out = df.pivot(index = 'New', columns = 'col1', values = 'col2')
Out[148]:
col1 name pages points position season
New
0 kobe 1 7.6 SF 96-97
1 kobe 2 15.4 SF 97-98
2 kobe 3 19.9 SG 98-99
3 kobe 4 22.5 SG 99-00
4 lebron 5 20.9 SF 03-04
5 lebron 6 27.2 SF 04-05
6 lebron 7 31.4 NaN 05-06
7 lebron 8 NaN NaN 06-07
Upvotes: 3
Reputation: 15478
I will use df.pivot()
and reset_index:
df.pivot(columns='col1',values='col2').reset_index()
Upvotes: 0