Reputation: 971
I am dealing with the following data frame:
id1 id2 lat lon
0 1 2 18.46290 -66.10301
1 1 2 37.549 -71.477
2 1 2 41.490 -71.326
3 0 0 0.0 0.0
4 1 3 42.058 -70.577
5 1 3 42.402 -70.642
6 1 3 41.589 -71.281
7 1 3 37.649 -71.440
8 0 0 0.0 0.0
9 2 2 18.462 -66.103
10 2 2 18.44857 -66.089
11 2 2 42.242 -70.97
12 0 0 0.0 0.0
The above data are trajectory data and I want to find all the trajectories which are starting OR ending in a city, say NYC.
Each trajectory has multiple points, and the trajectories are separated by all 0's row in the above df.
So, rows 0-2 is one trajectory, rows 4-7 another and so on. (Even if all 0's rows are required to be removed, "id1" and "id2" together identifies the different trajectories, like as we can see whenever id1 or id2 are changing, we are getting new trajectory).
I have another df, say df2 which contains all the coordinates of a particular region of the city in which I want the trajectories to be restricted:
lat lon
0 40.711 -70.000
1 40.734 -70.099
2 40.740 -70.110
3 40.750 -70.120
I want to match the first point and the last point of each trajectory from df1 with the coordinates in df2, if either the first point OR the last point matches with any coordinate pair in df2, I want to keep those trajectories only.
(As I mentioned I want to keep only trajectories which are ending and starting in the city, so in case of "rows 0-2" in df1, I would want to match lat and lon of row 0(starting point) and row 2(ending point of a trajectory) with df2, for "rows 4-7", I would match lat and lon of row 4(starting point) and then row 7(ending point) and so on for the whole df).
Maybe I could have found the solution if I had to just search for rows with particular "lat" and "lon", like in the following code:
mask = ((df["lat"].isin(df2["lat"])) && (df["lon"].isin(df2["lon"])))
new_df = pd.DataFrame(df[mask])
new_df.head()
But I have to query the starting and ending point of each trajectory which are separated by 0's. I have no clue, how to do that.
I hope the question is clear, let me know if anything is unclear.
Any help would be appreciated.
Upvotes: 1
Views: 1828
Reputation: 210852
Consider the following input DataFrames:
In [158]: df
Out[158]:
id1 id2 lat lon
0 1 2 18.46290 -66.10301
1 1 2 37.54900 -71.47700
2 1 2 41.49000 -71.32600
3 0 0 0.00000 0.00000
4 1 3 42.05800 -70.57700 # matching point
5 1 3 42.40200 -70.64200
6 1 3 41.58900 -71.28100
7 1 3 37.64900 -71.44000
8 0 0 0.00000 0.00000
9 2 2 18.46200 -66.10300
10 2 2 18.44857 -66.08900
11 2 2 42.24200 -70.97000 # matching point
12 0 0 0.00000 0.00000
In [159]: df2
Out[159]:
lat lon
0 40.711 -70.000
1 40.734 -70.099
2 40.740 -70.110
3 40.750 -70.120
4 42.058 -70.577 # matching point
5 42.242 -70.970 # matching point
let's find indexes of the starting and ending points for all trajectories:
In [164]: idx = df.loc[(df.id1!=0) & (df.id2!=0)] \
...: .groupby(['id1','id2'])['lat','lon'] \
...: .apply(lambda x: pd.Series([x.index[0], x.index[-1]])) \
...: .stack() \
...: .values
...:
In [165]: idx
Out[165]: array([ 0, 2, 4, 7, 9, 11], dtype=int64)
Let's multiply coordinates to 100 and truncate them to integers (because comparing floats in Python/Pandas is "evil"):
df2 = df2.mul(100).astype(int)
# `d` - will contain only starting and ending points for each trajectory
d = df.loc[idx]
d.loc[:, ['lat','lon']] = d[['lat','lon']].mul(100).astype(int)
now we have all coordinates as integers:
In [181]: d
Out[181]:
id1 id2 lat lon
0 1 2 1846 -6610
2 1 2 4149 -7132
4 1 3 4205 -7057
7 1 3 3764 -7144
9 2 2 1846 -6610
11 2 2 4224 -7097
In [163]: df2
Out[163]:
lat lon
0 4071 -7000
1 4073 -7009
2 4074 -7011
3 4075 -7012
4 4205 -7057
5 4224 -7097
so we can easily merge them:
In [185]: d.merge(df2)
Out[185]:
id1 id2 lat lon
0 1 3 4205 -7057
1 2 2 4224 -7097
and merge it again with the original DF:
In [186]: d.merge(df2)[['id1','id2']].merge(df)
Out[186]:
id1 id2 lat lon
0 1 3 42.05800 -70.577
1 1 3 42.40200 -70.642
2 1 3 41.58900 -71.281
3 1 3 37.64900 -71.440
4 2 2 18.46200 -66.103
5 2 2 18.44857 -66.089
6 2 2 42.24200 -70.970
Upvotes: 2