Reputation: 5559
I have 2 pandas dataframes: dataframe1
and dataframe2
that look like this:
mydataframe1
Out[15]:
Start End
100 200
300 450
500 700
mydataframe2
Out[16]:
Start End Value
0 400 0
401 499 -1
500 1000 1
1001 1698 1
Each row correspond to a segment (start-end). For each segment in dataframe1 I would like to assign a value depending on the values assigned to the segments in dataframe2.
For example:
the first segment in dataframe1 100 200
is included in the first segment of dataframe2 0 400
then I should assign the value 0
the second segment in dataframe1 300 450
is contained in the first 0 400
and second 401 499
segments of dataframe2. In this case I need to split this segments in 2 and assign the 2 corresponding values. ie 300 400 -> value 0
and 401 - 450 value ->-1
the final dataframe1 should look like
mydataframe1
Out[15]:
Start End Value
100 200 0
300 400 0
401 450 -1
500 700 1
I hope I was claer..can you help me?
Upvotes: 4
Views: 2765
Reputation: 31918
Here is an answer using the NCLS library. It does not do the splitting, but rather answers the question in the title and does so really quickly.
Setup:
from ncls import NCLS
contents = """Start End
100 200
300 450
500 700"""
import pandas as pd
from io import StringIO
df = pd.read_table(StringIO(contents), sep="\s+")
contents2 = """Start End Value
0 400 0
401 499 -1
500 1000 1
1001 1698 1"""
df2 = pd.read_table(StringIO(contents2), sep="\s+")
Execution:
n = NCLS(df.Start.values, df.End.values, df.index.values)
x, x2 = n.all_overlaps_both(df2.Start.values, df2.End.values, df2.index.values)
dfx = df.loc[x]
# Start End
# 0 100 200
# 0 100 200
# 1 300 450
# 2 500 700
df2x = df2.loc[x2]
# Start End Value
# 0 0 400 0
# 1 401 499 -1
# 1 401 499 -1
# 2 500 1000 1
dfx.insert(dfx.shape[1], "Value", df2x.Value.values)
# Start End Value
# 0 100 200 0
# 0 100 200 0
# 1 300 450 -1
# 2 500 700 1
Upvotes: 0
Reputation: 6302
I doubt that there is a Pandas method that you can use to solve this directly. You have to calculate the intersections manually to get the result you want. The intervaltree library makes the interval overlap calculation easier and more efficient at least.
IntervalTree.search()
returns the (full) intervals that overlap with the provided one but does not calculate their intersection. This is why I also apply the intersect()
function I have defined.
import pandas as pd
from intervaltree import Interval, IntervalTree
def intersect(a, b):
"""Intersection of two intervals."""
intersection = max(a[0], b[0]), min(a[1], b[1])
if intersection[0] > intersection[1]:
return None
return intersection
def interval_df_intersection(df1, df2):
"""Calculate the intersection of two sets of intervals stored in DataFrames.
The intervals are defined by the "Start" and "End" columns.
The data in the rest of the columns of df1 is included with the resulting
intervals."""
tree = IntervalTree.from_tuples(zip(
df1.Start.values,
df1.End.values,
df1.drop(["Start", "End"], axis=1).values.tolist()
))
intersections = []
for row in df2.itertuples():
i1 = Interval(row.Start, row.End)
intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]
# Make sure the column names are in the correct order
data_cols = list(df1.columns)
data_cols.remove("Start")
data_cols.remove("End")
return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)
interval_df_intersection(mydataframe2, mydataframe1)
The result is identical to what you were after.
Upvotes: 1