Liza
Liza

Reputation: 971

How to filter pandas data frame?

I am dealing with the following data frame:

      id1     id2      lat        lon
   0  1       2        18.46290   -66.10301
   1  1       2        37.549     -71.477
   2  1       2        41.490     -71.326
   3  0       0        0.0        0.0
   4  1       3        42.058     -70.577
   5  1       3        42.402     -70.642
   6  1       3        41.589     -71.281
   7  1       3        37.649     -71.440
   8  0       0        0.0        0.0
   9  2       2        18.462     -66.103
  10  2       2        18.44857   -66.089
  11  2       2        42.242     -70.97
  12  0       0        0.0        0.0 

The above data are trajectory data and I want to find all the trajectories which are starting OR ending in a city, say NYC.

Each trajectory has multiple points, and the trajectories are separated by all 0's row in the above df.

So, rows 0-2 is one trajectory, rows 4-7 another and so on. (Even if all 0's rows are required to be removed, "id1" and "id2" together identifies the different trajectories, like as we can see whenever id1 or id2 are changing, we are getting new trajectory).

I have another df, say df2 which contains all the coordinates of a particular region of the city in which I want the trajectories to be restricted:

          lat     lon
       0  40.711  -70.000
       1  40.734  -70.099
       2  40.740  -70.110
       3  40.750  -70.120

I want to match the first point and the last point of each trajectory from df1 with the coordinates in df2, if either the first point OR the last point matches with any coordinate pair in df2, I want to keep those trajectories only.

(As I mentioned I want to keep only trajectories which are ending and starting in the city, so in case of "rows 0-2" in df1, I would want to match lat and lon of row 0(starting point) and row 2(ending point of a trajectory) with df2, for "rows 4-7", I would match lat and lon of row 4(starting point) and then row 7(ending point) and so on for the whole df).

Maybe I could have found the solution if I had to just search for rows with particular "lat" and "lon", like in the following code:

       mask = ((df["lat"].isin(df2["lat"])) && (df["lon"].isin(df2["lon"])))
       new_df = pd.DataFrame(df[mask])
       new_df.head()

But I have to query the starting and ending point of each trajectory which are separated by 0's. I have no clue, how to do that.

I hope the question is clear, let me know if anything is unclear.

Any help would be appreciated.

Upvotes: 1

Views: 1828

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

Consider the following input DataFrames:

In [158]: df
Out[158]:
    id1  id2       lat       lon
0     1    2  18.46290 -66.10301
1     1    2  37.54900 -71.47700
2     1    2  41.49000 -71.32600
3     0    0   0.00000   0.00000
4     1    3  42.05800 -70.57700   # matching point
5     1    3  42.40200 -70.64200
6     1    3  41.58900 -71.28100
7     1    3  37.64900 -71.44000
8     0    0   0.00000   0.00000
9     2    2  18.46200 -66.10300
10    2    2  18.44857 -66.08900
11    2    2  42.24200 -70.97000   # matching point
12    0    0   0.00000   0.00000

In [159]: df2
Out[159]:
      lat     lon
0  40.711 -70.000
1  40.734 -70.099
2  40.740 -70.110
3  40.750 -70.120
4  42.058 -70.577   # matching point
5  42.242 -70.970   # matching point

let's find indexes of the starting and ending points for all trajectories:

In [164]: idx = df.loc[(df.id1!=0) & (df.id2!=0)] \
     ...:         .groupby(['id1','id2'])['lat','lon'] \
     ...:         .apply(lambda x: pd.Series([x.index[0], x.index[-1]])) \
     ...:         .stack() \
     ...:         .values
     ...:

In [165]: idx
Out[165]: array([ 0,  2,  4,  7,  9, 11], dtype=int64)

Let's multiply coordinates to 100 and truncate them to integers (because comparing floats in Python/Pandas is "evil"):

df2 = df2.mul(100).astype(int)

# `d` - will contain only starting and ending points for each trajectory
d =  df.loc[idx]
d.loc[:, ['lat','lon']] = d[['lat','lon']].mul(100).astype(int)

now we have all coordinates as integers:

In [181]: d
Out[181]:
    id1  id2   lat   lon
0     1    2  1846 -6610
2     1    2  4149 -7132
4     1    3  4205 -7057
7     1    3  3764 -7144
9     2    2  1846 -6610
11    2    2  4224 -7097


In [163]: df2
Out[163]:
    lat   lon
0  4071 -7000
1  4073 -7009
2  4074 -7011
3  4075 -7012
4  4205 -7057
5  4224 -7097

so we can easily merge them:

In [185]: d.merge(df2)
Out[185]:
   id1  id2   lat   lon
0    1    3  4205 -7057
1    2    2  4224 -7097

and merge it again with the original DF:

In [186]: d.merge(df2)[['id1','id2']].merge(df)
Out[186]:
   id1  id2       lat     lon
0    1    3  42.05800 -70.577
1    1    3  42.40200 -70.642
2    1    3  41.58900 -71.281
3    1    3  37.64900 -71.440
4    2    2  18.46200 -66.103
5    2    2  18.44857 -66.089
6    2    2  42.24200 -70.970

Upvotes: 2

Related Questions