Reputation: 615
I have this dataframe:
print (df)
exam student
0 French a
1 English a
2 Italian a
3 Chinese b
4 Russian b
5 German b
6 Chinese c
7 Spanish c
8 English c
9 French c
I need to find for each student the number of students that took the same exams as he did.
It should be something like this:
exam student total_st
0 French a 1
1 English a 1
2 Italian a 1
3 Chinese b 1
4 Russian b 1
5 German b 1
6 German c 2
7 Spanish c 2
8 English c 2
9 French c 2
Total number for student A is 1 because it has common exams just with one student (in this case with student C).
Total number for student B is 1 because it has common exams just with one student (in this case with student C).
Total number for student C is 2 because it has common exams with both students (with students A and B).
Any ideas?
Thank you in advance!
Upvotes: 0
Views: 735
Reputation: 214927
You can calculate a contingency table of exam
and student
firstly, and then do a cross product to check if there is any overlap of exams between student and count the number of students that have at least one share exam, and map the result to the original student column:
cont_table = pd.crosstab(df.exam, df.student)
# cont_table.T.dot(cont_table) gives a table how many exams each student shared with
# another student, -1 to exclude the student himself
shared_count = (cont_table.T.dot(cont_table) != 0).sum(0) - 1
shared_count
#student
#a 1
#b 1
#c 2
#dtype: int64
df['total_st'] = df.student.map(shared_count)
df
Upvotes: 2