gabboshow
gabboshow

Reputation: 5559

intersection of two columns of pandas dataframe

I have 2 pandas dataframes: dataframe1 and dataframe2 that look like this:

mydataframe1
Out[15]: 
    Start   End  
    100     200
    300     450
    500     700


mydataframe2
Out[16]:
  Start   End       Value     
  0       400       0  
  401     499       -1  
  500     1000      1  
  1001    1698      1  

Each row correspond to a segment (start-end). For each segment in dataframe1 I would like to assign a value depending on the values assigned to the segments in dataframe2.

For example:

the first segment in dataframe1 100 200 is included in the first segment of dataframe2 0 400 then I should assign the value 0

the second segment in dataframe1 300 450 is contained in the first 0 400 and second 401 499 segments of dataframe2. In this case I need to split this segments in 2 and assign the 2 corresponding values. ie 300 400 -> value 0 and 401 - 450 value ->-1

the final dataframe1 should look like

mydataframe1
Out[15]: 
    Start   End  Value
    100     200  0
    300     400  0
    401     450  -1
    500     700  1

I hope I was claer..can you help me?

Upvotes: 4

Views: 2765

Answers (2)

The Unfun Cat
The Unfun Cat

Reputation: 31918

Here is an answer using the NCLS library. It does not do the splitting, but rather answers the question in the title and does so really quickly.

Setup:

from ncls import NCLS

contents = """Start   End
100     200
300     450
500     700"""

import pandas as pd
from io import StringIO
df = pd.read_table(StringIO(contents), sep="\s+")

contents2 = """Start   End       Value
0       400       0
401     499       -1
500     1000      1
1001    1698      1"""
df2 = pd.read_table(StringIO(contents2), sep="\s+")

Execution:

n = NCLS(df.Start.values, df.End.values, df.index.values)

x, x2 = n.all_overlaps_both(df2.Start.values, df2.End.values, df2.index.values)

dfx = df.loc[x]
# Start  End
# 0    100  200
# 0    100  200
# 1    300  450
# 2    500  700

df2x = df2.loc[x2]
# Start   End  Value
# 0      0   400      0
# 1    401   499     -1
# 1    401   499     -1
# 2    500  1000      1

dfx.insert(dfx.shape[1], "Value", df2x.Value.values)
# Start  End  Value
# 0    100  200      0
# 0    100  200      0
# 1    300  450     -1
# 2    500  700      1

Upvotes: 0

Martin Valgur
Martin Valgur

Reputation: 6302

I doubt that there is a Pandas method that you can use to solve this directly. You have to calculate the intersections manually to get the result you want. The intervaltree library makes the interval overlap calculation easier and more efficient at least.

IntervalTree.search() returns the (full) intervals that overlap with the provided one but does not calculate their intersection. This is why I also apply the intersect() function I have defined.

import pandas as pd
from intervaltree import Interval, IntervalTree

def intersect(a, b):
    """Intersection of two intervals."""
    intersection = max(a[0], b[0]), min(a[1], b[1])
    if intersection[0] > intersection[1]:
        return None
    return intersection

def interval_df_intersection(df1, df2):
    """Calculate the intersection of two sets of intervals stored in DataFrames.
    The intervals are defined by the "Start" and "End" columns.
    The data in the rest of the columns of df1 is included with the resulting
    intervals."""
    tree = IntervalTree.from_tuples(zip(
            df1.Start.values,
            df1.End.values,
            df1.drop(["Start", "End"], axis=1).values.tolist()
        ))

    intersections = []
    for row in df2.itertuples():
        i1 = Interval(row.Start, row.End)
        intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]

    # Make sure the column names are in the correct order
    data_cols = list(df1.columns)
    data_cols.remove("Start")
    data_cols.remove("End")
    return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)

interval_df_intersection(mydataframe2, mydataframe1)

The result is identical to what you were after.

Upvotes: 1

Related Questions