Reputation: 615
I have the following dataframe:
exam_id student semester
0 01 a 1
1 02 b 2
2 03 c 3
3 01 d 1
4 02 e 2
5 03 f 3
6 01 g 1
I would like to create a new dataframe containing four columns: "student", "shared exam with", "semester", "number of shared exams".
student shared_exam_with semester number_of_shared_exam
0 a d 1 1
1 a g 1 1
2 b e 2 1
3 c f 3 1
4 d a 1 1
5 d g 1 1
6 e b 2 1
7 f c 3 1
8 g a 1 1
9 g d 1 1
Any suggestion?
Upvotes: 0
Views: 231
Reputation: 294218
idx_cols = ['exam_id', 'semester']
std_cols = ['student_x', 'student_y']
d1 = df.merge(df, on=idx_cols)
d2 = d1.loc[d1.student_x != d1.student_y, idx_cols + std_cols]
d2.loc[:, std_cols] = np.sort(d2.loc[:, std_cols])
d3 = d2.drop_duplicates().groupby(
std_cols + ['semester']).size().reset_index(name='count')
print(d3)
student_x student_y semester count
0 a d 1 1
1 a g 1 1
2 b e 2 1
3 c f 3 1
4 d g 1 1
how it works
merge
on just semester
and exam_id
Upvotes: 2