Reputation:
I'm new to pandas and I don't know the best way to do this.
I have two files which I've placed in two different dataframes:
>> frame1.head()
Out[64]:
Date and Time Sample Unnamed: 2
0 05/18/2017 08:38:37:490 163.7 NaN
1 05/18/2017 08:39:37:490 164.5 NaN
2 05/18/2017 08:40:37:490 148.7 NaN
3 05/18/2017 08:41:37:490 111.2 NaN
4 05/18/2017 08:42:37:490 83.6 NaN
>>frame2.head()
Out[66]:
Date and Time Sample Unnamed: 2
0 05/18/2017 08:38:38:490 7.5 NaN
1 05/18/2017 08:39:38:490 7.5 NaN
2 05/18/2017 08:40:38:490 7.5 NaN
3 05/18/2017 08:41:38:490 7.5 NaN
4 05/18/2017 08:42:38:490 7.5 NaN
I need to "merge" any row from frame 1, with any row in frame 2, that are within one second of each other.
For example, this row from frame 1:
0 05/18/2017 08:38:37:490 163.7 NaN
is within one second of this row from frame 2:
0 05/18/2017 08:38:38:490 7.5 NaN
So when they are "merged" The output should be like this:
0 05/18/2017 08:38:37:490 163.7 7.5 NaN NaN
in other words, one row has it's time replaced by the other, and the all of the remaining columns are just appended
The closest I've come up with is to do something like:
d3 = pd.merge(frame1, frame2, on='Date and Time (MM/DD/YYYY HH:MM:SS:sss)', how='outer')
>>d3.head()
Date and Time Sample_x Unnamed: 2_x Sample_y Unnamed: 2_y
0 05/18/2017 08:38:37:490 163.7 NaN NaN NaN
1 05/18/2017 08:39:37:490 164.5 NaN NaN NaN
2 05/18/2017 08:40:37:490 148.7 NaN NaN NaN
3 05/18/2017 08:41:37:490 111.2 NaN NaN NaN
4 05/18/2017 08:42:37:490 83.6 NaN NaN NaN
But, that isn't a conditional merge.. .I need to merge if they are within one second of each other, not just exactly the same.
I know I can compare the times with something like:
def compare_time(temp, sec=1):
return abs(current - temp) <= datetime.timedelta(seconds=sec)
then use .apply() or something... but I have no idea how to piece all this together
EDIT: it looks like pd.merge_asof does a good job, but I also need to retain the lines that aren't matched / merged in the final frame as well
EDIT 2:
df1 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods= 4,freq='s'),
'sample': np.arange(4)+100 })
df2 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods=4,freq='300ms'),
'sample': np.arange(4) })
blah = pd.merge_asof( df2, df1, on='datetime', tolerance=pd.Timedelta('1s') ) \
.append(df1.rename(columns={'sample':'sample_x'})).drop_duplicates('sample_x')
blah
returns:
datetime sample_x sample_y
0 2017-01-01 00:00:00.000 0 100.0
1 2017-01-01 00:00:00.300 1 100.0
2 2017-01-01 00:00:00.600 2 100.0
3 2017-01-01 00:00:00.900 3 100.0
0 2017-01-01 00:00:00.000 100 NaN
1 2017-01-01 00:00:01.000 101 NaN
2 2017-01-01 00:00:02.000 102 NaN
3 2017-01-01 00:00:03.000 103 NaN
Notice it's retaining the original row indexes (zero is listed twice)..
Upvotes: 5
Views: 3877
Reputation: 30404
You can use merge_asof
as @Wen suggests, but be sure to specify the optional value for tolerance
. Also consider the setting the option value for the direction
of your match which can be 'backward' (default), 'nearest', or 'forward'.
pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )
Here's a longer explanation with sample data (Note I'm just creating new sample data since I can only see the first few rows of your actual data):
df1 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods= 4,freq='s'),
'sample': np.arange(4)+100 })
df2 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods=4,freq='300ms'),
'sample': np.arange(4) })
df1
Out[208]:
datetime sample
0 2017-01-01 00:00:00 100
1 2017-01-01 00:00:01 101
2 2017-01-01 00:00:02 102
3 2017-01-01 00:00:03 103
df2
Out[209]:
datetime sample
0 2017-01-01 00:00:00.000 0
1 2017-01-01 00:00:00.300 1
2 2017-01-01 00:00:00.600 2
3 2017-01-01 00:00:00.900 3
pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )
Out[210]:
datetime sample_x sample_y
0 2017-01-01 00:00:00 100 0.0
1 2017-01-01 00:00:01 101 3.0
2 2017-01-01 00:00:02 102 NaN
3 2017-01-01 00:00:03 103 NaN
Note that merge_asof
does a left join so you can get a different answer by changing the order of df1 & df2:
pd.merge_asof( df2, df1, on='datetime', tolerance=pd.Timedelta('1s') )
Out[218]:
datetime sample_x sample_y
0 2017-01-01 00:00:00.000 0 100
1 2017-01-01 00:00:00.300 1 100
2 2017-01-01 00:00:00.600 2 100
3 2017-01-01 00:00:00.900 3 100
Edit to add: the docs say merge_asof
does a left join by design but it seems to differ from a true left join in that it excludes rows in the left dataframe that don't match. To fix that you could do something like this:
pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') ) \
.append(df1.rename(columns={'sample':'sample_x'})).drop_duplicates('sample_x')
Out[236]:
datetime sample_x sample_y
0 2017-01-01 00:00:00 100 0.0
1 2017-01-01 00:00:01 101 3.0
2 2017-01-01 00:00:02 102 NaN
3 2017-01-01 00:00:03 103 NaN
Note that you may need to adjust drop_duplicates
based on whether or not you have a unique index and/or unique columns.
Upvotes: 1