Baron Yugovich
Baron Yugovich

Reputation: 4313

Pandas dataframes - join on similar timestamps

I have 2 dataframes,

small_df = 
   time_early            
0, 18:19:20.877154
1, 20:34:24.738802

and large_df, with many more rows

   time_late      
0, 11:12:23.879154
1, 11:12:23.879154            
2, 18:19:20.879154
3, 19:01:20.877154
4, 20:34:24.748802

I want to join them in such a way that every row in small_df is joined to a row in large_df that comes immediately after it, so that the desired result looks something like

   time_early           time_late 
0, 18:19:20.877154      18:19:20.879154
1, 20:34:24.738802      20:34:24.748802

Also, assume that these 2 dataframes may have other columns that I would like to maintain in the final result. How do I achieve this? I know I need some kind of merge, but not exactly sure.

Upvotes: 0

Views: 450

Answers (2)

Nader Hisham
Nader Hisham

Reputation: 5414

def join_closest_time(df):
    # first of all get values that is greater than time_early for each row
    time_greater = large_df.time_late > df['time_early']
    # subset data to get only the first one , this should be the closest one
    # to time early if time_late columns is sorted in ascending order
    close_date = large_df[time_greater].iloc[0]
    # then concatenate rows from both data frames
    df_final = pd.concat([df , close_date])
    return df_final

small_df.apply(join_closest_time, axis = 1)


Out[116]:
    time_early          time_late
0   18:19:20.877154 18:19:20.879154
1   20:34:24.738802 20:34:24.748802

if your large_df is not sorted by time_late you've to sort it first in ascending order

large_df.sort_index(by = 'time_late' , inplace=True)

Upvotes: 1

Alexander
Alexander

Reputation: 109636

If there is any time_late following a specific time_early value, take the first value. Otherwise, use None.

small_df['time_late'] = \
    small_df.time_early.apply(lambda time: large_df[large_df.time_late > time].values[0][0]        
                                           if large_df.time_late.gt(time).any() else None)

>>> small_df
        time_early        time_late
0  18:19:20.877154  18:19:20.879154
1  20:34:24.738802  20:34:24.748802

Upvotes: 0

Related Questions