Reputation: 513
I have two pandas dataframes. The first dataframe contains the location of different circles over time. Example: df1 =
x y time
circle
1.0 235 133 1.0
2.0 236 133 1.0
3.0 245 425 1.0
4.0 215 325 2.0
5.0 287 203 4.0
6.0 394 394 5.0
The second dataframe is organised exactly like the first, but contains the locations of squares at different times. Example: df2 =
x y time
square
1.0 243 233 1.0
1.0 293 436 2.0
2.0 189 230 3.0
2.0 189 233 4.0
3.0 176 203 4.0
3.0 374 394 5.0
I would like to figure out how to loop through the dataframe df1 to access all the present squares in df2 at each time point, to find out which is closest.
Example output:
x y time closest_sq sq_x sq_y
circle
1.0 235 133 1.0 1.0 243 233
2.0 236 133 1.0 1.0 243 233
3.0 245 425 1.0 1.0 243 233
4.0 215 325 2.0 1.0 243 233
5.0 287 203 4.0 2.0 189 233
6.0 394 394 5.0 3.0 374 394
I'm guessing I have to use either iterrows() or itertuples() in a for loop to get at this but I'm not sure, and scipy cdist to get the distance.
Upvotes: 1
Views: 2651
Reputation: 2522
Let's first make our two DataFrames, circles and squares.
import pandas as pd
import numpy as np
circles = pd.DataFrame({'circle': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 5.0, 5: 6.0},
'time': {0: 1.0, 1: 1.0, 2: 1.0, 3: 2.0, 4: 4.0, 5: 5.0},
'x': {0: 235, 1: 236, 2: 245, 3: 215, 4: 287, 5: 394},
'y': {0: 133, 1: 133, 2: 425, 3: 325, 4: 203, 5: 394}})
circles
-------------------------------
circle time x y
0 1.0 1.0 235 133
1 2.0 1.0 236 133
2 3.0 1.0 245 425
3 4.0 2.0 215 325
4 5.0 4.0 287 203
5 6.0 5.0 394 394
squares = pd.DataFrame({'square': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0, 4: 3.0, 5: 3.0},
'time': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 4.0, 5: 5.0},
'x': {0: 243, 1: 293, 2: 189, 3: 189, 4: 176, 5: 374},
'y': {0: 233, 1: 436, 2: 230, 3: 233, 4: 203, 5: 394}})\
.set_index('time')
squares
------------------
square x y
time
1.0 1.0 243 233
2.0 1.0 293 436
3.0 2.0 189 230
4.0 2.0 189 233
4.0 3.0 176 203
5.0 3.0 374 394
So those are our DataFrames, let's define a function that we can apply to each circle in circles which will return us the corresponding closest square.
def closest_square(circleseries):
"""
This function takes a pd.Series which is what pd.DataFrame.apply gives
the function that you apply, in this case to the DataFrame circles. So
the pd.Series it will get from apply will represent a single circle.
We use loc to get the squares at the correct time point which gives us
a DataFrame that is a subset of squares. We then use assign to give
that DataFrame an extra column with the distance and the ID of the
circle. Finally we sort this whole thing in ascending order by the
distance to our circle, take the square with the lowest distance to
our circle (iloc[0]). We also reset the index because we don't want
the time column from squares to be our but a regular column.
"""
return squares.loc[[circleseries.time], :]\
.assign(distance = lambda s: np.sqrt( (s.x- circleseries.x)**2 + (s.y- circleseries.y)**2),
circle=circleseries.circle)\
.sort_values('distance', ascending=True)\
.reset_index()\
.iloc[0]
Let's apply the function to circles.
circles.apply(closest_square, axis=1)
--------------------------------------------------------
time square x y circle distance
0 1.0 1.0 243.0 233.0 1.0 100.319490
1 1.0 1.0 243.0 233.0 2.0 100.244701
2 1.0 1.0 243.0 233.0 3.0 192.010416
3 2.0 1.0 293.0 436.0 4.0 135.665029
4 4.0 2.0 189.0 233.0 5.0 102.489024
5 5.0 3.0 374.0 394.0 6.0 20.000000
Because we added the circle column in our apply, we can simply merge on it.
circles.merge(circles.apply(closest_square, axis=1), on='circle')
----------------------------------------------------------------------
circle time_x x_x y_x time_y square x_y y_y distance
0 1.0 1.0 235 133 1.0 1.0 243.0 233.0 100.319490
1 2.0 1.0 236 133 1.0 1.0 243.0 233.0 100.244701
2 3.0 1.0 245 425 1.0 1.0 243.0 233.0 192.010416
3 4.0 2.0 215 325 2.0 1.0 293.0 436.0 135.665029
4 5.0 4.0 287 203 4.0 2.0 189.0 233.0 102.489024
5 6.0 5.0 394 394 5.0 3.0 374.0 394.0 20.000000
Upvotes: 1
Reputation: 697
You can user pd.merge()
to join the dataframes and then use df.loc
. Here's how you can do this with your dataframes:
df3 = pd.merge(df1,df2,on='time',how='inner')
df4 = df3.loc[df3['time'] == 1.0]
df4[['circle','square','time']].head()
You can optimize the above code by using inplace=True
.
Upvotes: 3