Jiayu Zhang
Jiayu Zhang

Reputation: 719

How to convert data from row to column in python pandas?

I have a dataframe like this,

df_nba = pd.DataFrame({'col1': ['name', 'season',np.nan,'points','position','pages',
                     'name', 'season',np.nan,'points','position','pages',
                     'name', 'season',np.nan,'points','position','pages',
                     'name', 'season',np.nan,'points','position','pages',
                     'name', 'season',np.nan,'points','pages',
                     'name', 'season',np.nan,'points','position','pages',
                     'name', 'season',np.nan,'points','position','pages',
                     'name', 'season',np.nan,'pages'
                               ], 
             'col2': ['kobe', np.nan,'96-97',7.6,'SF',1,
                      'kobe', np.nan,'97-98',15.4,'SF',2,
                      'kobe', np.nan,'98-99',19.9,'SG',3,
                      'kobe', np.nan,'99-00',22.5,'SG',4,
                     'lebron', np.nan,'03-04',20.9,5,
                     'lebron', np.nan,'04-05',27.2,'SF',6,
                     'lebron', np.nan,'05-06',31.4,'SF',7,
                     'lebron', np.nan,'06-07',8
                     ]})  

col1 is the variable name and col2 is the value corresponding to the left side.

col1       col2
    0   name       kobe
    1   season      NaN
    2   NaN        96-97
    3   points      7.6
    4   position    SF
    5   pages        1
    6   name       kobe
    7   season      NaN
    8   NaN        97-98
    9   points     15.4
    10  position    SF
    11  pages        2
    12  name       kobe
    13  season      NaN
    14  NaN        98-99
    15  points     19.9
    16  position    SG
    17  pages        3
    18  name       kobe
    19  season      NaN
    20  NaN        99-00
    21  points     22.5
    22  position    SG
    23  pages        4
    24  name      lebron
    25  season      NaN
    26  NaN        03-04
    27  points     20.9
    28  pages        5
    29  name      lebron
    30  season      NaN
    31  NaN        04-05
    32  points     27.2
    33  position    SF
    34  pages        6
    35  name      lebron
    36  season      NaN
    37  NaN        05-06
    38  points     31.4
    39  position    SF
    40  pages        7
    41  name      lebron
    42  season      NaN
    43  NaN        06-07
    44  pages        8

Can I have the data by transpose the variable name to different columns? The output will be like,

Page   name       season   points   position
1      kobe       96-97     7.6       SF
2      kobe       97-98    15.4       SF
3      kobe       98-99    19.9       SG
4      kobe       99-00    22.5       SG
5      lebron     03-04    20.9      NaN
6      lebron     04-05    27.2       SF
7      lebron     05-06    31.4       SF
8      lebron     06-07     NaN      NaN

Things I have tried is using group by pages to re-arrange the dataframe,

df_new = pd.DataFrame( 
        df_nba.groupby((df_nba['col1'] == 'pages').cumsum())['col2'].agg(list).to_list(), 
        columns=['name', 'xxx', 'season', 'points', 'position','page'] ).drop(columns='xxx')

But it seems output is not correct. I don't know how to handle the missing data in page 5 and page 8. Because it is not just missing values but the variable name also not exist in original dataset.

Upvotes: 1

Views: 343

Answers (2)

BENY
BENY

Reputation: 323246

Check sorted + cumcount then pivot

df = df.apply(lambda x : sorted(x,key=pd.notnull)).dropna()
df['New'] = df.groupby('col1').cumcount()
out = df.pivot(index = 'New', columns = 'col1', values = 'col2')
Out[148]: 
col1    name pages points position season
New                                      
0       kobe     1    7.6       SF  96-97
1       kobe     2   15.4       SF  97-98
2       kobe     3   19.9       SG  98-99
3       kobe     4   22.5       SG  99-00
4     lebron     5   20.9       SF  03-04
5     lebron     6   27.2       SF  04-05
6     lebron     7   31.4      NaN  05-06
7     lebron     8    NaN      NaN  06-07

Upvotes: 3

wasif
wasif

Reputation: 15478

I will use df.pivot() and reset_index:

df.pivot(columns='col1',values='col2').reset_index()

Upvotes: 0

Related Questions