Pivoting a table and adding features

Question

I have created the following dataframe, where a given assignment work_id is performed by a student s_id in a date work_date. A work_id generally consists of three tests, specified by test_code, and in the dataframe each performed test has a score and a relative threshold thresh. A comment comment1 is also present if test1 is performed.

In [5]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: df = pd.DataFrame(columns=['work_id', 'test_code', 's_id', 'score','thresh','work_date','comment1'],
   ...:                   data =[['a1','test1', 'p01',5, 1,'2020-06-15','score was 5'],
   ...:                          ['a2','test1', 'p01',10,1,'2020-06-10','score was 10'],
   ...:                          ['a2','test3', 'p01',7, 3,'2020-06-10',np.nan],
   ...:                          ['a3','test2', 'p01',6, 2,'2020-05-01',np.nan],
   ...:                          ['a4','test1', 'p02',4,1,'2020-06-20','score was 4'],
   ...:                          ['a4','test2', 'p02',5,2,'2020-06-20',np.nan],
   ...:                          ['a5','test3', 'p02',2,3,'2019-10-10',np.nan],
   ...:                          ['a6','test3', 'p02',7,3,'2020-04-01',np.nan]])
   ...: df
Out[5]:
  work_id test_code s_id  score  thresh   work_date      comment1
0      a1     test1  p01      5       1  2020-06-15   score was 5
1      a2     test1  p01     10       1  2020-06-10  score was 10
2      a2     test3  p01      7       3  2020-06-10           NaN
3      a3     test2  p01      6       2  2020-05-01           NaN
4      a4     test1  p02      4       1  2020-06-20   score was 4
5      a4     test2  p02      5       2  2020-06-20           NaN
6      a5     test3  p02      2       3  2019-10-10           NaN
7      a6     test3  p02      7       3  2020-04-01           NaN

I want to pivot the dataframe having work_id as index. A simple pivot table would be this:

work_id s_id  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3        date      comment1
0      a1  p01      5.0      NaN      NaN       1.0       NaN       NaN  2020-06-15   score was 5
1      a2  p01     10.0      NaN      7.0       1.0       NaN       3.0  2020-06-10  score was 10
2      a3  p01      NaN      6.0      NaN       NaN       2.0       NaN  2020-05-01           NaN
3      a4  p02      4.0      5.0      NaN       1.0       2.0       NaN  2020-06-20   score was 4
4      a5  p02      NaN      NaN      2.0       NaN       NaN       3.0  2019-10-10           NaN
5      a6  p02      NaN      NaN      7.0       NaN       NaN       3.0  2020-04-01           NaN

Using

df_p = pd.pivot_table(df,  index=['work_id','s_id','work_date','comment1'], columns='test_code', values=['score','thresh']).reset_index()

does not include all the rows with missing comments.

Furthermore, for each work_id, I would like

to replace the missing test_code values with those from the previous work_id (i.e. the work_id with the closest work_date in the past) performed by the same s_id,
to add a column Delta1 showing the variation in score of test1 with respect to the previous work_id performed by the same s_id, and the average Mean1 over all the test1 performed by the same s_id,

to finally obtain this:

work_id s_id  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3        date      comment1  Delta1  Mean1
0      a1  p01      5.0      6.0      7.0       1.0       2.0       3.0  2020-06-15   score was 5    -5.0    7.5
1      a2  p01     10.0      6.0      7.0       1.0       2.0       3.0  2020-06-10  score was 10     NaN    NaN
2      a3  p01      NaN      6.0      NaN       NaN       2.0       NaN  2020-05-01           NaN     NaN    NaN
3      a4  p02      4.0      5.0      7.0       1.0       2.0       3.0  2020-06-20   score was 4     NaN    4.0
4      a5  p02      NaN      NaN      2.0       NaN       NaN       3.0  2019-10-10           NaN     NaN    NaN
5      a6  p02      NaN      NaN      7.0       NaN       NaN       3.0  2020-04-01           NaN     NaN    NaN

Shubham Sharma · Accepted Answer

Use:

# step-1
df1 = df.astype({'comment1': 'str'}).set_index(
    ['work_id', 's_id', 'comment1', 'work_date', 'test_code']).unstack()
df1.columns = df1.columns.map(lambda s: '_'.join([s[0], s[1].strip('test')]))

# step-2
df1 = df1.reset_index()
df1['comment1'] = df1['comment1'].replace('nan', np.nan)
df1 = df1.groupby(['work_id', 's_id'], as_index=False).first().sort_values('work_date')

# step-3
g = df1['score_1'].groupby(df1['s_id'])
df1['Delta1'] = g.diff()
df1['Mean1'] = g.transform('mean').mask(lambda x: x.duplicated(keep='last'))

# step-4
cols = df1.columns.str.contains(r'_\d+$')
df1.loc[:, cols] = df1.loc[:, cols].groupby(df1['s_id']).ffill()
df1 = df1.sort_index()

Details:

Step-1: Use DataFrame.unstack to pivot the dataframe and use map with join to flatten the MultiLevel columns.

# step-1
                                      score_1  score_2  score_3  thresh_1  thresh_2  thresh_3
work_id s_id comment1     work_date                                                          
a1      p01  score was 5  2020-06-15      5.0      NaN      NaN       1.0       NaN       NaN
a2      p01  nan          2020-06-10      NaN      NaN      7.0       NaN       NaN       3.0
             score was 10 2020-06-10     10.0      NaN      NaN       1.0       NaN       NaN
a3      p01  nan          2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN
a4      p02  nan          2020-06-20      NaN      5.0      NaN       NaN       2.0       NaN
             score was 4  2020-06-20      4.0      NaN      NaN       1.0       NaN       NaN
a5      p02  nan          2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0
a6      p02  nan          2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0

Step-2: Use groupby on work_id and s_id and use the aggregation function first and sort the dataframe on work_date

# step-2
  work_id s_id      comment1   work_date  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3
4      a5  p02           NaN  2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0
5      a6  p02           NaN  2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0
2      a3  p01           NaN  2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN
1      a2  p01  score was 10  2020-06-10     10.0      NaN      7.0       1.0       NaN       3.0
0      a1  p01   score was 5  2020-06-15      5.0      NaN      NaN       1.0       NaN       NaN
3      a4  p02   score was 4  2020-06-20      4.0      5.0      NaN       1.0       2.0       NaN

Step-3: Calculate Delta1 and Mean1 by grouping the score_1 on s_id and using the appropriate transformation functions.

# step-3
      work_id s_id      comment1   work_date  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3  Delta1  Mean1
    4      a5  p02           NaN  2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0     NaN    NaN
    5      a6  p02           NaN  2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0     NaN    NaN
    2      a3  p01           NaN  2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN     NaN    NaN
    1      a2  p01  score was 10  2020-06-10     10.0      NaN      7.0       1.0       NaN       3.0     NaN    NaN
    0      a1  p01   score was 5  2020-06-15      5.0      NaN      NaN       1.0       NaN       NaN    -5.0    7.5
    3      a4  p02   score was 4  2020-06-20      4.0      5.0      NaN       1.0       2.0       NaN     NaN    4.0

Step-4: Replace the missing test_code values with those from the previous work_id.

# step-4
  work_id s_id      comment1   work_date  score_1  score_2  score_3  thresh_1  thresh_2  thresh_3  Delta1  Mean1
0      a1  p01   score was 5  2020-06-15      5.0      6.0      7.0       1.0       2.0       3.0    -5.0    7.5
1      a2  p01  score was 10  2020-06-10     10.0      6.0      7.0       1.0       2.0       3.0     NaN    NaN
2      a3  p01           NaN  2020-05-01      NaN      6.0      NaN       NaN       2.0       NaN     NaN    NaN
3      a4  p02   score was 4  2020-06-20      4.0      5.0      7.0       1.0       2.0       3.0     NaN    4.0
4      a5  p02           NaN  2019-10-10      NaN      NaN      2.0       NaN       NaN       3.0     NaN    NaN
5      a6  p02           NaN  2020-04-01      NaN      NaN      7.0       NaN       NaN       3.0     NaN    NaN

Pivoting a table and adding features

Answers (1)

Related Questions