intersection of two columns of pandas dataframe

Question

I have 2 pandas dataframes: dataframe1 and dataframe2 that look like this:

mydataframe1
Out[15]: 
    Start   End  
    100     200
    300     450
    500     700


mydataframe2
Out[16]:
  Start   End       Value     
  0       400       0  
  401     499       -1  
  500     1000      1  
  1001    1698      1

Each row correspond to a segment (start-end). For each segment in dataframe1 I would like to assign a value depending on the values assigned to the segments in dataframe2.

For example:

the first segment in dataframe1 100 200 is included in the first segment of dataframe2 0 400 then I should assign the value 0

the second segment in dataframe1 300 450 is contained in the first 0 400 and second 401 499 segments of dataframe2. In this case I need to split this segments in 2 and assign the 2 corresponding values. ie 300 400 -> value 0 and 401 - 450 value ->-1

the final dataframe1 should look like

mydataframe1
Out[15]: 
    Start   End  Value
    100     200  0
    300     400  0
    401     450  -1
    500     700  1

I hope I was claer..can you help me?

Martin Valgur · Accepted Answer

I doubt that there is a Pandas method that you can use to solve this directly. You have to calculate the intersections manually to get the result you want. The intervaltree library makes the interval overlap calculation easier and more efficient at least.

IntervalTree.search() returns the (full) intervals that overlap with the provided one but does not calculate their intersection. This is why I also apply the intersect() function I have defined.

import pandas as pd
from intervaltree import Interval, IntervalTree

def intersect(a, b):
    """Intersection of two intervals."""
    intersection = max(a[0], b[0]), min(a[1], b[1])
    if intersection[0] > intersection[1]:
        return None
    return intersection

def interval_df_intersection(df1, df2):
    """Calculate the intersection of two sets of intervals stored in DataFrames.
    The intervals are defined by the "Start" and "End" columns.
    The data in the rest of the columns of df1 is included with the resulting
    intervals."""
    tree = IntervalTree.from_tuples(zip(
            df1.Start.values,
            df1.End.values,
            df1.drop(["Start", "End"], axis=1).values.tolist()
        ))

    intersections = []
    for row in df2.itertuples():
        i1 = Interval(row.Start, row.End)
        intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]

    # Make sure the column names are in the correct order
    data_cols = list(df1.columns)
    data_cols.remove("Start")
    data_cols.remove("End")
    return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)

interval_df_intersection(mydataframe2, mydataframe1)

The result is identical to what you were after.

intersection of two columns of pandas dataframe

Answers (2)

Related Questions