Reputation: 737
I have a dataframe with students and grades in each of the exams. For each exam there is a set of intervals and if the grade of the student is inside the interval the student gets the amount of points specified in the points
column. I want to calculate the final points for each student.
Given these dataframes:
>>> df_students
student_name grade_first_exam grade_second_exam
0 Alice 5 5
1 Bob 2 3
2 John 8 9
>>> df_intervals
exam_id intervals points
0 grade_first_exam [0, 5) 1
1 grade_first_exam (5, 10] 5
2 grade_second_exam [0, 4] 1
3 grade_second_exam (4, 7] 2
4 grade_second_exam (7, 10] 5
I would like to calculate the final score for each student
student_name grade_first_exam grade_second_exam final_score
0 Alice 5 5 2
1 Bob 2 3 2
2 John 8 9 10
I solved it by selecting each student and each group of intervals (for the same test) and then calculating the score.
import pandas as pd
from functools import partial
def calculate_scores_single_student(student):
return (
df_intervals.groupby("exam_id").apply(partial(calculate_score_single_exam, student)).sum()
)
def calculate_score_single_exam(student, intervals_test):
is_in_interval_mask = intervals_test.apply(
lambda x: student[intervals_test.exam_id.unique()][0] in x["intervals"], axis=1
)
return intervals_test[is_in_interval_mask]["points"]
df_students = pd.DataFrame(
{
"student_name": ["Alice", "Bob", "John"],
"grade_first_exam": [5, 2, 8],
"grade_second_exam": [5, 3, 9],
}
)
df_intervals = pd.DataFrame(
{
"exam_id": [
"grade_first_exam",
"grade_first_exam",
"grade_second_exam",
"grade_second_exam",
"grade_second_exam",
],
"intervals": [
pd.Interval(0, 5, closed="left"),
pd.Interval(5, 10),
pd.Interval(0, 4, closed="both"),
pd.Interval(4, 7),
pd.Interval(7, 10),
],
"points": [1, 5, 1, 2, 5],
}
)
df_students["final_score"] = df_students.apply(calculate_scores_single_student, axis=1)
The problem with this solution is that the complexity is O(SEI)
Where S
is the number of students, E
the number of different exams and I
the number of intervals for each exam. This obviously doesn't scale to my real data (120K students).
The problem with other approaches that I tried is that pd.Interval
is not vectorizable so it's not possible to do
pd.DataFrame([1,2]) in pd.Interval(3,4)
to check for every number in the df if it is inside the interval (returning a Series with booleans).
Edit: The intervals must be flexible (for a given exam the intervals can be open or closed on either side)
Upvotes: 1
Views: 1185
Reputation: 737
I decided to extend the class pd.Interval
to implement a method series_inside_interval
that allows comparison of pd.Series
with pd.interval
import pandas as pd
class CustomRuleInterval(pd.Interval):
def __init__(self, left, right, closed="right"):
super().__init__(left, right, closed)
def series_inside_interval(self, ds):
# We could override the __contains__ of pd.series but seems like too much of a headache
if self.closed == "right":
return (self.left < ds) & (ds <= self.right)
elif self.closed == "left":
return (self.left <= ds) & (ds < self.right)
elif self.closed == "both":
return (self.left <= ds) & (ds <= self.right)
else:
# neither
return (self.left < ds) & (ds < self.right)
def scores_for_ds(ds):
return (
df_intervals[df_intervals["exam_id"] == ds.name]
.apply(
lambda interval_row: interval_row["intervals"].series_inside_interval(ds)
* interval_row["points"],
axis=1,
)
.T.sum(1)
)
df_students = pd.DataFrame(
{
"student_name": ["Alice", "Bob", "John"],
"grade_first_exam": [5, 2, 8],
"grade_second_exam": [5, 3, 9],
}
)
df_intervals = pd.DataFrame(
{
"exam_id": [
"grade_first_exam",
"grade_first_exam",
"grade_second_exam",
"grade_second_exam",
"grade_second_exam",
],
"intervals": [
CustomRuleInterval(0, 5, closed="left"),
CustomRuleInterval(5, 10),
CustomRuleInterval(0, 4, closed="both"),
CustomRuleInterval(4, 7),
CustomRuleInterval(7, 10),
],
"points": [1, 5, 1, 2, 5],
}
)
df_students[df_intervals.exam_id.unique()].apply(scores_for_ds, axis=0).sum(1)
Upvotes: 0
Reputation: 195418
A step-by-step approach (using pd.cut
and converting the intervals to pd.IntervalIndex
):
first_exam = df_intervals[df_intervals['exam_id']=='grade_first_exam']
second_exam = df_intervals[df_intervals['exam_id']=='grade_second_exam']
ct1 = pd.cut(df_students['grade_first_exam'], pd.IntervalIndex( first_exam['intervals'] )).to_frame()
ct2 = pd.cut(df_students['grade_second_exam'], pd.IntervalIndex( second_exam['intervals'] )).to_frame()
p1 = pd.merge(first_exam, ct1, left_on='intervals', right_on='grade_first_exam', how='right')['points']
p2 = pd.merge(second_exam, ct2, left_on='intervals', right_on='grade_second_exam', how='right')['points']
df_students['final_score'] = p1 + p2
print(df_students)
Prints:
student_name grade_first_exam grade_second_exam final_score
0 Alice 5 5 3
1 Bob 2 3 2
2 John 8 9 10
Upvotes: 2