Reputation: 89
I need to combine two dataframes that contain information about train track sections: while the "Line" identifies a track section, the two attributes "A" and "B" are given for subsections of the Line defined by start point and end point on the line; these subsections do not match between the two dataframes:
df1
Line startpoint endpoint Attribute_A
100 2.506 2.809 B-70
100 2.809 2.924 B-91
100 2.924 4.065 B-84
100 4.065 4.21 B-70
100 4.21 4.224 B-91
...
df2
Line startpoint endpoint Attribute_B
100 2.5 2.6 140
100 2.6 2.7 158
100 2.7 2.8 131
100 2.8 2.9 124
100 2.9 3.0 178
...
What I would need is a merged dataframe that gives me the combination of Attributes A and B for the respective minimal subsections where they are shared:
df3
Line startpoint endpoint Attribute_A Attribute_B
100 2.5 2.506 nan 140
100 2.506 2.6 B-70 140
100 2.6 2.7 B-70 158
100 2.7 2.8 B-70 131
100 2.8 2.809 B-70 124
100 2.809 2.9 B-91 124
100 2.9 2.924 B-91 178
100 2.924 3.0 B-84 178
...
How can I do this best in python? I'm somewhate new to it and while I get around basic calculations between rows and columns, I'm at my wit's ends with this problem; the approach of merging and sorting the two dataframes and calculating the respective differences between start- / endpoints didn't get me very far and I can't seem to find applicable information on the forums. I'm grateful for any hint !
Upvotes: 1
Views: 311
Reputation: 7496
Here is my solution, a bit long but it works:
First step is finding the intervals:
all_start_points = set(df1['startpoint'].values.tolist() + df2['startpoint'].values.tolist())
all_end_points = set(df1['endpoint'].values.tolist() + df2['endpoint'].values.tolist())
all_points = sorted(list(all_start_points.union(all_end_points)))
intervals = [(start, end) for start, end in zip(all_points[:-1], all_points[1:])]
Then we need to find the relevant interval in each dataframe (if present):
import numpy as np
def find_interval(df, interval):
return df[(df['startpoint']<=interval[0]) &
(df['endpoint']>=interval[1])]
attr_A = [find_interval(df1, intv)['Attribute_A'] for intv in intervals]
attr_A = [el.iloc[0] if len(el)>0 else np.nan for el in attr_A]
attr_B = [find_interval(df2, intv)['Attribute_B'] for intv in intervals]
attr_B = [el.iloc[0] if len(el)>0 else np.nan for el in attr_B]
Finally, we put everything together:
out = pd.DataFrame(intervals, columns = ['startpoint', 'endpoint'])
out = pd.concat([out, pd.Series(attr_A).to_frame('Attribute_A'), pd.Series(attr_B).to_frame('Attribute_B')], axis = 1)
out['Line'] = 100
And I get the expected result:
out
Out[111]:
startpoint endpoint Attribute_A Attribute_B Line
0 2.500 2.506 NaN 140.0 100
1 2.506 2.600 B-70 140.0 100
2 2.600 2.700 B-70 158.0 100
3 2.700 2.800 B-70 131.0 100
4 2.800 2.809 B-70 124.0 100
5 2.809 2.900 B-91 124.0 100
6 2.900 2.924 B-91 178.0 100
7 2.924 3.000 B-84 178.0 100
8 3.000 4.065 B-84 NaN 100
9 4.065 4.210 B-70 NaN 100
10 4.210 4.224 B-91 NaN 100
Upvotes: 1