Arnold Klein
Arnold Klein

Reputation: 3086

Reshape GroupBy in Pandas and pad with nan if missing

Given a data frame with various numbers of elements in each group ('groupby' by some variable), I need to reshape into a matrix with predefined number of columns. For example:

    summary_x  participant_id_x response_date cuts
0         3.0                11    2016-05-05    a
1         3.0                11    2016-05-06    a
2         4.0                11    2016-05-07    a
3         4.0                11    2016-05-08    a
4         3.0                11    2016-05-09    a
5         3.0                11    2016-05-10    a
6         3.0                11    2016-05-11    a
7         3.0                11    2016-05-12    a
8         3.0                11    2016-05-13    a
9         3.0                11    2016-05-14    a
13        4.0                11    2016-05-22    b
14        4.0                11    2016-05-23    b
15        3.0                11    2016-05-24    b
16        3.0                11    2016-05-25    b
17        3.0                11    2016-05-26    b
18        3.0                11    2016-05-27    b
19        3.0                11    2016-05-28    b
20        3.0                11    2016-06-02    c
21        3.0                11    2016-06-03    c
22        3.0                11    2016-06-04    c
23        3.0                11    2016-06-05    c
24        3.0                11    2016-06-06    c
25        3.0                11    2016-06-07    c
26        3.0                11    2016-06-08    c
27        3.0                11    2016-06-09    c
28        3.0                11    2016-06-10    c
29        5.0                11    2016-06-11    c

Each group (by'cuts'), contains 10 elements, but the group 'b' contains only 7. I would like to have a matrix from 'summary_x' reshaped as (3,10), where the missing values will be filled in with nans:

pd.DataFrame(df.summary_x.values.reshape((-1,10)))

      0    1    2    3    4    5    6    7    8    9
0   3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1   nan  nan  nan  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2   3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

any solutions?

Upvotes: 1

Views: 471

Answers (1)

jezrael
jezrael

Reputation: 863166

You can use cumcount with [::-1] for change order of columns (rows):

g = df.groupby('cuts').cumcount(ascending=False)
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).iloc[:,::-1]
       .reset_index(drop=True)
df.columns = np.arange(len(df.columns))
print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  NaN  NaN  NaN  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

Another solution:

L = df[::-1].groupby('cuts')['summary_x'].apply(list).values.tolist()
df = pd.DataFrame(L).iloc[:, ::-1]
df.columns = np.arange(len(df.columns))
print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  NaN  NaN  NaN  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

But if NaNs can be in the end also:

g = df.groupby('cuts').cumcount()
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).reset_index(drop=True)

print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  4.0  4.0  3.0  3.0  3.0  3.0  3.0  NaN  NaN  NaN
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

Upvotes: 1

Related Questions