Reputation: 119
I have created the following dataframe, where a given assignment work_id
is performed by a student s_id
in a date work_date
. A work_id
generally consists of three tests, specified by test_code, and in the dataframe each performed test has a score and a relative threshold thresh
. A comment comment1
is also present if test1
is performed.
In [5]: import pandas as pd
...: import numpy as np
...:
...: df = pd.DataFrame(columns=['work_id', 'test_code', 's_id', 'score','thresh','work_date','comment1'],
...: data =[['a1','test1', 'p01',5, 1,'2020-06-15','score was 5'],
...: ['a2','test1', 'p01',10,1,'2020-06-10','score was 10'],
...: ['a2','test3', 'p01',7, 3,'2020-06-10',np.nan],
...: ['a3','test2', 'p01',6, 2,'2020-05-01',np.nan],
...: ['a4','test1', 'p02',4,1,'2020-06-20','score was 4'],
...: ['a4','test2', 'p02',5,2,'2020-06-20',np.nan],
...: ['a5','test3', 'p02',2,3,'2019-10-10',np.nan],
...: ['a6','test3', 'p02',7,3,'2020-04-01',np.nan]])
...: df
Out[5]:
work_id test_code s_id score thresh work_date comment1
0 a1 test1 p01 5 1 2020-06-15 score was 5
1 a2 test1 p01 10 1 2020-06-10 score was 10
2 a2 test3 p01 7 3 2020-06-10 NaN
3 a3 test2 p01 6 2 2020-05-01 NaN
4 a4 test1 p02 4 1 2020-06-20 score was 4
5 a4 test2 p02 5 2 2020-06-20 NaN
6 a5 test3 p02 2 3 2019-10-10 NaN
7 a6 test3 p02 7 3 2020-04-01 NaN
I want to pivot the dataframe having work_id
as index. A simple pivot table would be this:
work_id s_id score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 date comment1
0 a1 p01 5.0 NaN NaN 1.0 NaN NaN 2020-06-15 score was 5
1 a2 p01 10.0 NaN 7.0 1.0 NaN 3.0 2020-06-10 score was 10
2 a3 p01 NaN 6.0 NaN NaN 2.0 NaN 2020-05-01 NaN
3 a4 p02 4.0 5.0 NaN 1.0 2.0 NaN 2020-06-20 score was 4
4 a5 p02 NaN NaN 2.0 NaN NaN 3.0 2019-10-10 NaN
5 a6 p02 NaN NaN 7.0 NaN NaN 3.0 2020-04-01 NaN
Using
df_p = pd.pivot_table(df, index=['work_id','s_id','work_date','comment1'], columns='test_code', values=['score','thresh']).reset_index()
does not include all the rows with missing comments.
Furthermore, for each work_id
, I would like
test_code
values with those from the previous work_id
(i.e. the work_id with the closest work_date in the past) performed by the same s_id,Delta1
showing the variation in score of test1
with respect to the previous work_id
performed by the same s_id
, and the average Mean1
over all the test1 performed by the same s_id
,to finally obtain this:
work_id s_id score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 date comment1 Delta1 Mean1
0 a1 p01 5.0 6.0 7.0 1.0 2.0 3.0 2020-06-15 score was 5 -5.0 7.5
1 a2 p01 10.0 6.0 7.0 1.0 2.0 3.0 2020-06-10 score was 10 NaN NaN
2 a3 p01 NaN 6.0 NaN NaN 2.0 NaN 2020-05-01 NaN NaN NaN
3 a4 p02 4.0 5.0 7.0 1.0 2.0 3.0 2020-06-20 score was 4 NaN 4.0
4 a5 p02 NaN NaN 2.0 NaN NaN 3.0 2019-10-10 NaN NaN NaN
5 a6 p02 NaN NaN 7.0 NaN NaN 3.0 2020-04-01 NaN NaN NaN
Upvotes: 1
Views: 68
Reputation: 71687
Use:
# step-1
df1 = df.astype({'comment1': 'str'}).set_index(
['work_id', 's_id', 'comment1', 'work_date', 'test_code']).unstack()
df1.columns = df1.columns.map(lambda s: '_'.join([s[0], s[1].strip('test')]))
# step-2
df1 = df1.reset_index()
df1['comment1'] = df1['comment1'].replace('nan', np.nan)
df1 = df1.groupby(['work_id', 's_id'], as_index=False).first().sort_values('work_date')
# step-3
g = df1['score_1'].groupby(df1['s_id'])
df1['Delta1'] = g.diff()
df1['Mean1'] = g.transform('mean').mask(lambda x: x.duplicated(keep='last'))
# step-4
cols = df1.columns.str.contains(r'_\d+$')
df1.loc[:, cols] = df1.loc[:, cols].groupby(df1['s_id']).ffill()
df1 = df1.sort_index()
Details:
Step-1: Use DataFrame.unstack
to pivot the dataframe and use map
with join
to flatten the MultiLevel
columns.
# step-1
score_1 score_2 score_3 thresh_1 thresh_2 thresh_3
work_id s_id comment1 work_date
a1 p01 score was 5 2020-06-15 5.0 NaN NaN 1.0 NaN NaN
a2 p01 nan 2020-06-10 NaN NaN 7.0 NaN NaN 3.0
score was 10 2020-06-10 10.0 NaN NaN 1.0 NaN NaN
a3 p01 nan 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN
a4 p02 nan 2020-06-20 NaN 5.0 NaN NaN 2.0 NaN
score was 4 2020-06-20 4.0 NaN NaN 1.0 NaN NaN
a5 p02 nan 2019-10-10 NaN NaN 2.0 NaN NaN 3.0
a6 p02 nan 2020-04-01 NaN NaN 7.0 NaN NaN 3.0
Step-2: Use groupby
on work_id
and s_id
and use the aggregation function first
and sort the dataframe on work_date
# step-2
work_id s_id comment1 work_date score_1 score_2 score_3 thresh_1 thresh_2 thresh_3
4 a5 p02 NaN 2019-10-10 NaN NaN 2.0 NaN NaN 3.0
5 a6 p02 NaN 2020-04-01 NaN NaN 7.0 NaN NaN 3.0
2 a3 p01 NaN 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN
1 a2 p01 score was 10 2020-06-10 10.0 NaN 7.0 1.0 NaN 3.0
0 a1 p01 score was 5 2020-06-15 5.0 NaN NaN 1.0 NaN NaN
3 a4 p02 score was 4 2020-06-20 4.0 5.0 NaN 1.0 2.0 NaN
Step-3: Calculate Delta1
and Mean1
by grouping the score_1
on s_id
and using the appropriate transformation functions.
# step-3
work_id s_id comment1 work_date score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 Delta1 Mean1
4 a5 p02 NaN 2019-10-10 NaN NaN 2.0 NaN NaN 3.0 NaN NaN
5 a6 p02 NaN 2020-04-01 NaN NaN 7.0 NaN NaN 3.0 NaN NaN
2 a3 p01 NaN 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN NaN NaN
1 a2 p01 score was 10 2020-06-10 10.0 NaN 7.0 1.0 NaN 3.0 NaN NaN
0 a1 p01 score was 5 2020-06-15 5.0 NaN NaN 1.0 NaN NaN -5.0 7.5
3 a4 p02 score was 4 2020-06-20 4.0 5.0 NaN 1.0 2.0 NaN NaN 4.0
Step-4: Replace the missing test_code
values with those from the previous work_id
.
# step-4
work_id s_id comment1 work_date score_1 score_2 score_3 thresh_1 thresh_2 thresh_3 Delta1 Mean1
0 a1 p01 score was 5 2020-06-15 5.0 6.0 7.0 1.0 2.0 3.0 -5.0 7.5
1 a2 p01 score was 10 2020-06-10 10.0 6.0 7.0 1.0 2.0 3.0 NaN NaN
2 a3 p01 NaN 2020-05-01 NaN 6.0 NaN NaN 2.0 NaN NaN NaN
3 a4 p02 score was 4 2020-06-20 4.0 5.0 7.0 1.0 2.0 3.0 NaN 4.0
4 a5 p02 NaN 2019-10-10 NaN NaN 2.0 NaN NaN 3.0 NaN NaN
5 a6 p02 NaN 2020-04-01 NaN NaN 7.0 NaN NaN 3.0 NaN NaN
Upvotes: 1