UserR6
UserR6

Reputation: 513

Looping through a pandas Dataframe to get values from another Dataframe

I have two pandas dataframes. The first dataframe contains the location of different circles over time. Example: df1 =

               x      y      time
circle
 1.0          235    133     1.0
 2.0          236    133     1.0
 3.0          245    425     1.0
 4.0          215    325     2.0
 5.0          287    203     4.0
 6.0          394    394     5.0

The second dataframe is organised exactly like the first, but contains the locations of squares at different times. Example: df2 =

               x      y      time
square
 1.0          243    233     1.0
 1.0          293    436     2.0
 2.0          189    230     3.0
 2.0          189    233     4.0
 3.0          176    203     4.0
 3.0          374    394     5.0

I would like to figure out how to loop through the dataframe df1 to access all the present squares in df2 at each time point, to find out which is closest.

Example output:

               x      y      time      closest_sq      sq_x      sq_y
circle
 1.0          235    133     1.0          1.0           243       233
 2.0          236    133     1.0          1.0           243       233
 3.0          245    425     1.0          1.0           243       233
 4.0          215    325     2.0          1.0           243       233
 5.0          287    203     4.0          2.0           189       233    
 6.0          394    394     5.0          3.0           374       394 

I'm guessing I have to use either iterrows() or itertuples() in a for loop to get at this but I'm not sure, and scipy cdist to get the distance.

Upvotes: 1

Views: 2651

Answers (2)

tobsecret
tobsecret

Reputation: 2522

Let's first make our two DataFrames, circles and squares.

import pandas as pd
import numpy as np
circles = pd.DataFrame({'circle': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 5.0, 5: 6.0},
         'time': {0: 1.0, 1: 1.0, 2: 1.0, 3: 2.0, 4: 4.0, 5: 5.0},
         'x': {0: 235, 1: 236, 2: 245, 3: 215, 4: 287, 5: 394},
         'y': {0: 133, 1: 133, 2: 425, 3: 325, 4: 203, 5: 394}})
circles
-------------------------------
        circle  time    x   y
    0   1.0     1.0    235  133
    1   2.0     1.0    236  133
    2   3.0     1.0    245  425
    3   4.0     2.0    215  325
    4   5.0     4.0    287  203
    5   6.0     5.0    394  394

squares = pd.DataFrame({'square': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0, 4: 3.0, 5: 3.0},
                        'time': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 4.0, 5: 5.0},
                        'x': {0: 243, 1: 293, 2: 189, 3: 189, 4: 176, 5: 374},
                        'y': {0: 233, 1: 436, 2: 230, 3: 233, 4: 203, 5: 394}})\
                        .set_index('time')

squares
------------------
    square  x   y
time            
1.0  1.0    243 233
2.0  1.0    293 436
3.0  2.0    189 230
4.0  2.0    189 233
4.0  3.0    176 203
5.0  3.0    374 394

So those are our DataFrames, let's define a function that we can apply to each circle in circles which will return us the corresponding closest square.

def closest_square(circleseries):
"""
This function takes a pd.Series which is what pd.DataFrame.apply gives 
the function that you apply, in this case to the DataFrame circles. So 
the pd.Series it will get from apply will represent a single circle. 
We use loc to get the squares at the correct time point which gives us
a DataFrame that is a subset of squares. We then use assign to give
that DataFrame an extra column with the distance and the ID of the
circle. Finally we sort this whole thing in ascending order by the 
distance to our circle, take the square with the lowest distance to
our circle (iloc[0]). We also reset the index because we don't want
the time column from squares to be our but a regular column.
"""
return squares.loc[[circleseries.time], :]\
           .assign(distance = lambda s: np.sqrt( (s.x- circleseries.x)**2 + (s.y- circleseries.y)**2),
 circle=circleseries.circle)\
           .sort_values('distance', ascending=True)\
           .reset_index()\
           .iloc[0]

Let's apply the function to circles.

circles.apply(closest_square, axis=1)
--------------------------------------------------------
        time    square  x       y       circle  distance
   0    1.0     1.0    243.0    233.0   1.0    100.319490
   1    1.0     1.0    243.0    233.0   2.0    100.244701
   2    1.0     1.0    243.0    233.0   3.0    192.010416
   3    2.0     1.0    293.0    436.0   4.0    135.665029
   4    4.0     2.0    189.0    233.0   5.0    102.489024
   5    5.0     3.0    374.0    394.0   6.0    20.000000

Because we added the circle column in our apply, we can simply merge on it.

circles.merge(circles.apply(closest_square, axis=1), on='circle')
----------------------------------------------------------------------
    circle  time_x  x_x y_x time_y  square  x_y     y_y     distance
0   1.0     1.0     235 133 1.0     1.0     243.0   233.0   100.319490
1   2.0     1.0     236 133 1.0     1.0     243.0   233.0   100.244701
2   3.0     1.0     245 425 1.0     1.0     243.0   233.0   192.010416
3   4.0     2.0     215 325 2.0     1.0     293.0   436.0   135.665029
4   5.0     4.0     287 203 4.0     2.0     189.0   233.0   102.489024
5   6.0     5.0     394 394 5.0     3.0     374.0   394.0   20.000000

Upvotes: 1

Shrinivas Deshmukh
Shrinivas Deshmukh

Reputation: 697

You can user pd.merge() to join the dataframes and then use df.loc. Here's how you can do this with your dataframes:

df3 = pd.merge(df1,df2,on='time',how='inner')
df4 = df3.loc[df3['time'] == 1.0]
df4[['circle','square','time']].head()

You can optimize the above code by using inplace=True.

Upvotes: 3

Related Questions