winnie
winnie

Reputation: 109

Turning groupby into single row with new columns

I want to be able to turn a groupby into a single row, but the values of a second column in that groupby to be aggregated into new columns or -99, if there isn't sufficient data.

After we group by session_id with this input:

             user_id     session_id   timestamp  step  impressions   n_clicks
0       004A07DM0IDW  1d688ec168932  1541555799     7      2059240        5.0
1       004A07DM0IDW  1d688ec168932  1541555799     7      2033381        3.0
2       004A07DM0IDW  1d688ec168932  1541555799     7      1724779        4.0
3       004A07DM0IDW  1d688ec168932  1541555799     7       127131        2.0
4       004A07DM0IDW  1d688ec168932  1541555799     7       399441        1.0
5       004A07DM0IDW  1d688ec168932  1541555799     7       103357        3.0
6       004A07DM0IDW  1d688ec168932  1541555799     7       127132        3.0
7       004A07DM0IDW  1d688ec168932  1541555799     7      1167004        1.0
8       004A07DM0IDW  1d688ec168932  1541555799     7      4491766        4.0
9       004A07DM0IDW  1d688ec168932  1541555799     7      2249874        5.0
10      00Y1Z24X8084  26b6d294d66e7  1541651823     3      4476010        4.0
11      00Y1Z24X8084  26b6d294d66e7  1541651823     3      3843244        5.0

I'd like to produce this output

             user_id     session_id   timestamp  step  count_0 count_1 count_2 count... count_24
0       004A07DM0IDW  1d688ec168932  1541555799     7      5.0     3.0    4.0    2.0         -99
1       00Y1Z24X8084  26b6d294d66e7  1541555799     3      4.0     5.0    -99    -99         -99

What we're looking is that the user_id session_id timestamp step will always be the same for each row. However, the impressions are different. For each of the rows (there will be max 25), the value in the click column is mapped onto a count_x, however, if the number of rows is not sufficient, the subsequent values will take -99.

Since there are 10 rows in the first groupby frame, it means that columns count_10 to count_24 will have -99 as the value. For the second groupby frame columns count_2 to count_24 will have -99.

Upvotes: 1

Views: 348

Answers (1)

jezrael
jezrael

Reputation: 862691

Use:

cols = ['user_id','session_id','timestamp','step']
df['g'] = df.groupby(cols).cumcount()
df = (df.set_index(cols + ['g'])['n_clicks']
        .unstack(fill_value=-99)
        .reindex(range(25), fill_value=-99, axis=1)
        .add_prefix('count_')
        .reset_index()
        .rename_axis(None, axis=1))
print (df)
        user_id     session_id   timestamp  step  count_0  count_1  count_2  \
0  004A07DM0IDW  1d688ec168932  1541555799     7      5.0      3.0      4.0   
1  00Y1Z24X8084  26b6d294d66e7  1541651823     3      4.0      5.0    -99.0   

   count_3  count_4  count_5  ...  count_15  count_16  count_17  count_18  \
0      2.0      1.0      3.0  ...       -99       -99       -99       -99   
1    -99.0    -99.0    -99.0  ...       -99       -99       -99       -99   

   count_19  count_20  count_21  count_22  count_23  count_24  
0       -99       -99       -99       -99       -99       -99  
1       -99       -99       -99       -99       -99       -99  

[2 rows x 29 columns]

Explanation:

  1. Create column by GroupBy.cumcount for counter
  2. Create MultiIndex by DataFrame.set_index and reshape by Series.unstack
  3. Add missing columns by range(25) by DataFrame.reindex
  4. Rename columns names by DataFrame.add_prefix
  5. Last cleaning - DataFrame.rename_axis with DataFrame.reset_index

Upvotes: 3

Related Questions