codeclash
codeclash

Reputation: 2063

Transforming pandas dataframe - convert some row values to columns

I have a pandas dataframe of the form

advert_id     run_id         category       score
    11111111         78              842    0.356975
    11111111         78              849    0.245583
    11111111         78              950    0.089219
    11111111         78             1645    0.089172
    11111111         78             2494    0.044254
        ...        ...              ...         ...
    22222222    1083267             2521    0.078275
    22222222    1083267             2553    0.121556
    22222222    1083267             2872    0.039226
    22222222    1083267             3045    0.362127
    22222222    1083267             3049    0.040135

And would like to transform it to a dataframe of the form (one row now per advert_id):

advert_id     run_id      category_1     score_1    category_2     score_2     category_3     score_3  ... category_n  score_n
 11111111         78             842    0.356975           849    0.245583            950    0.089219  ... 
 22222222    1083267            2521    0.078275          2553    0.121556           2872    0.039226  ...

The number of category per advert can vary, some adverts may have 1..n categories.

Is there an elegant way to do this with python/pandas other than grouping the dataframe and "manually" iterating over the groups and populating a separate dataframe?

Upvotes: 2

Views: 255

Answers (2)

nick
nick

Reputation: 1360

You could use pivot to change the dataframe to the wide format. As a raw output, the pivot will leave you with a hierarchical index on the columns but you can flatten that to make the columns look like you wanted above.

# add a key as in solution above from BEN_YO
df['temp_key']=(df.groupby('advert_id').cumcount()+1)
# do the pivot
df = df.pivot(index='advert_id', columns="temp_key", values=["category", "score"])
# make the columns look as expected
df.columns = [' '.join(col).strip() for col in df.columns.values]
print(df.head())

The result is as desired:

           category 1  category 2  category 3  ...    score 4   score 5 
advert_id                                      ...                      
11111111        842.0       849.0       950.0  ...    0.089172  0.044254 
22222222       2521.0      2553.0      2872.0  ...    0.362127  0.040135

Upvotes: 2

BENY
BENY

Reputation: 323386

After create the additional key with cumcount

df['key2']=(df.groupby('advert_id').cumcount()+1)

s=df.set_index(['advert_id','run_id','key2']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('{0[0]}_{0[1]}'.format)
s
Out[59]: 
                   category_1   score_1  ...  category_5   score_5
advert_id run_id                         ...                      
11111111  78              842  0.356975  ...        2494  0.044254
22222222  1083267        2521  0.078275  ...        3049  0.040135
[2 rows x 10 columns]

Upvotes: 6

Related Questions