Reputation: 737

Efficient way of checking if pandas dataframe is inside given intervals

I have a dataframe with students and grades in each of the exams. For each exam there is a set of intervals and if the grade of the student is inside the interval the student gets the amount of points specified in the points column. I want to calculate the final points for each student.

Given these dataframes:

>>> df_students

  student_name  grade_first_exam  grade_second_exam
0        Alice                 5                  5
1          Bob                 2                  3
2         John                 8                  9

>>> df_intervals

             exam_id intervals  points
0   grade_first_exam    [0, 5)       1
1   grade_first_exam   (5, 10]       5
2  grade_second_exam    [0, 4]       1
3  grade_second_exam    (4, 7]       2
4  grade_second_exam   (7, 10]       5

I would like to calculate the final score for each student

  student_name  grade_first_exam  grade_second_exam  final_score
0        Alice                 5                  5            2
1          Bob                 2                  3            2
2         John                 8                  9           10

I solved it by selecting each student and each group of intervals (for the same test) and then calculating the score.

import pandas as pd
from functools import partial


def calculate_scores_single_student(student):
    return (
        df_intervals.groupby("exam_id").apply(partial(calculate_score_single_exam, student)).sum()
    )


def calculate_score_single_exam(student, intervals_test):

    is_in_interval_mask = intervals_test.apply(
        lambda x: student[intervals_test.exam_id.unique()][0] in x["intervals"], axis=1
    )
    return intervals_test[is_in_interval_mask]["points"]


df_students = pd.DataFrame(
    {
        "student_name": ["Alice", "Bob", "John"],
        "grade_first_exam": [5, 2, 8],
        "grade_second_exam": [5, 3, 9],
    }
)

df_intervals = pd.DataFrame(
    {
        "exam_id": [
            "grade_first_exam",
            "grade_first_exam",
            "grade_second_exam",
            "grade_second_exam",
            "grade_second_exam",
        ],
        "intervals": [
            pd.Interval(0, 5, closed="left"),
            pd.Interval(5, 10),
            pd.Interval(0, 4, closed="both"),
            pd.Interval(4, 7),
            pd.Interval(7, 10),
        ],
        "points": [1, 5, 1, 2, 5],
    }
)

df_students["final_score"] = df_students.apply(calculate_scores_single_student, axis=1)

The problem with this solution is that the complexity is O(SEI)

Where S is the number of students, E the number of different exams and I the number of intervals for each exam. This obviously doesn't scale to my real data (120K students).

The problem with other approaches that I tried is that pd.Interval is not vectorizable so it's not possible to do

pd.DataFrame([1,2]) in pd.Interval(3,4)

to check for every number in the df if it is inside the interval (returning a Series with booleans).

Edit: The intervals must be flexible (for a given exam the intervals can be open or closed on either side)

Upvotes: 1

Answers (2)

Gonzalo Hernandez

Reputation: 737

I decided to extend the class pd.Interval to implement a method series_inside_interval that allows comparison of pd.Series with pd.interval

import pandas as pd

class CustomRuleInterval(pd.Interval):
    def __init__(self, left, right, closed="right"):
        super().__init__(left, right, closed)

    def series_inside_interval(self, ds):
        # We could override the __contains__ of pd.series but seems like too much of a headache
        if self.closed == "right":
            return (self.left < ds) & (ds <= self.right)
        elif self.closed == "left":
            return (self.left <= ds) & (ds < self.right)
        elif self.closed == "both":
            return (self.left <= ds) & (ds <= self.right)
        else:
            # neither
            return (self.left < ds) & (ds < self.right)

def scores_for_ds(ds):
    return (
        df_intervals[df_intervals["exam_id"] == ds.name]
        .apply(
            lambda interval_row: interval_row["intervals"].series_inside_interval(ds)
            * interval_row["points"],
            axis=1,
        )
        .T.sum(1)
    )

df_students = pd.DataFrame(
    {
        "student_name": ["Alice", "Bob", "John"],
        "grade_first_exam": [5, 2, 8],
        "grade_second_exam": [5, 3, 9],
    }
)

df_intervals = pd.DataFrame(
    {
        "exam_id": [
            "grade_first_exam",
            "grade_first_exam",
            "grade_second_exam",
            "grade_second_exam",
            "grade_second_exam",
        ],
        "intervals": [
            CustomRuleInterval(0, 5, closed="left"),
            CustomRuleInterval(5, 10),
            CustomRuleInterval(0, 4, closed="both"),
            CustomRuleInterval(4, 7),
            CustomRuleInterval(7, 10),
        ],
        "points": [1, 5, 1, 2, 5],
    }
)


df_students[df_intervals.exam_id.unique()].apply(scores_for_ds, axis=0).sum(1)

Upvotes: 0

Andrej Kesely

Reputation: 195418

A step-by-step approach (using pd.cut and converting the intervals to pd.IntervalIndex):

first_exam = df_intervals[df_intervals['exam_id']=='grade_first_exam']
second_exam = df_intervals[df_intervals['exam_id']=='grade_second_exam']

ct1 = pd.cut(df_students['grade_first_exam'], pd.IntervalIndex( first_exam['intervals'] )).to_frame()
ct2 = pd.cut(df_students['grade_second_exam'], pd.IntervalIndex( second_exam['intervals'] )).to_frame()

p1 = pd.merge(first_exam, ct1, left_on='intervals', right_on='grade_first_exam', how='right')['points']
p2 = pd.merge(second_exam, ct2, left_on='intervals', right_on='grade_second_exam', how='right')['points']

df_students['final_score'] = p1 + p2

print(df_students)

Prints:

  student_name  grade_first_exam  grade_second_exam  final_score
0        Alice                 5                  5            3
1          Bob                 2                  3            2
2         John                 8                  9           10

Upvotes: 2

Efficient way of checking if pandas dataframe is inside given intervals

Answers (2)

Related Questions