Matthias
Matthias

Reputation: 5764

Pandas: get observations by timestamp

I got a list of dynamic values (e.g. observations). It records all value changes of an entity (e.g. display).

df
+----+---------------------+-----------------+---------+
|    | time                |   display_index | value   |
|----+---------------------+-----------------+---------|
|  0 | 2017-11-06 13:00:00 |               1 | val1    |
|  1 | 2017-11-06 14:00:00 |               1 | val2    |
|  2 | 2017-11-06 15:00:00 |               1 | val1    |
|  3 | 2017-11-06 13:30:00 |               2 | val3    |
|  4 | 2017-11-06 14:05:00 |               2 | val4    |
|  5 | 2017-11-06 15:30:00 |               2 | val1    |
+----+---------------------+-----------------+---------+

Now I got a second list of timestamps and I'm interested in the values that each display has shown at that time. Note that the first timestamp (13:00) for display_index 2 is before any value is even known for that one (first record is 13:30).

df_times
+----+---------------------+-----------------+
|    | time                |   display_index |
|----+---------------------+-----------------|
|  0 | 2017-11-06 13:20:00 |               1 |
|  1 | 2017-11-06 13:40:00 |               1 |
|  2 | 2017-11-06 13:00:00 |               2 |
|  3 | 2017-11-06 14:00:00 |               2 |
+----+---------------------+-----------------+

I tried calculating the period between both timestamps and chose the observation with the minimum value for that period:

df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))

+----+---------------------+-----------------+---------------------+---------+-----------+
|    | time                |   display_index | time_measured       | value   |   seconds |
|----+---------------------+-----------------+---------------------+---------+-----------|
|  0 | 2017-11-06 13:00:00 |               2 | 2017-11-06 13:30:00 | val3    |      1800 |
|  1 | 2017-11-06 13:20:00 |               1 | 2017-11-06 13:00:00 | val1    |      1200 |
|  2 | 2017-11-06 13:40:00 |               1 | 2017-11-06 14:00:00 | val2    |      1200 |
|  3 | 2017-11-06 14:00:00 |               2 | 2017-11-06 14:05:00 | val4    |       300 |
+----+---------------------+-----------------+---------------------+---------+-----------+

The problem is that the last values for display 1 and 2 are wrong since they are still showing another value at that time. It should be val1 for display 1 and val3 for display 2. What I'm actually looking for is the observation that was last seen before the timestamp. So how to do this?

Here's the code that I used:

import pandas as pd
from tabulate import tabulate
import math

values = [("2017-11-06 13:00", 1, 'val1'),
          ("2017-11-06 14:00", 1, 'val2'),
          ("2017-11-06 15:00", 1, 'val1'),
          ("2017-11-06 13:30", 2, 'val3'),
          ("2017-11-06 14:05", 2, 'val4'),
          ("2017-11-06 15:30", 2, 'val1'),
         ]
labels = ['time', 'display_index', 'value']
df = pd.DataFrame.from_records(values, columns=labels)
df['time'] = pd.to_datetime(df['time']) 
print(tabulate(df, headers='keys', tablefmt='psql'))

values = [("2017-11-06 13:20", 1),
          ("2017-11-06 13:40", 1),
          ("2017-11-06 13:00", 2),
          ("2017-11-06 14:00", 2),
         ]
labels = ['time', 'display_index']
df_times = pd.DataFrame.from_records(values, columns=labels)
df_times['time'] = pd.to_datetime(df_times['time']) 
print(tabulate(df_times, headers='keys', tablefmt='psql'))

df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))

Upvotes: 1

Views: 855

Answers (1)

piRSquared
piRSquared

Reputation: 294258

This is a perfect use case for pd.merge_asof
Note: I think you got the second row wrong.

# dataframes need to be sorted
df_times = df_times.sort_values(['time', 'display_index'])
df = df.sort_values(['time', 'display_index'])

pd.merge_asof(
    df_times, df.assign(time_measured=df.time),
    on='time', by='display_index', direction='forward'
).assign(seconds=lambda d: d.time_measured.sub(d.time).dt.total_seconds())

                 time  display_index value       time_measured  seconds
0 2017-11-06 13:00:00              2  val3 2017-11-06 13:30:00   1800.0
1 2017-11-06 13:20:00              1  val2 2017-11-06 14:00:00   2400.0
2 2017-11-06 13:40:00              1  val2 2017-11-06 14:00:00   1200.0
3 2017-11-06 14:00:00              2  val4 2017-11-06 14:05:00    300.0

Explanation

  • pd.merge_asof for every row in the left argument, it attempts to locate a matching row in the right argument.
  • Since we passed direction='forward' it will look forward from the row in the left argument and find the next value.
  • I needed a way to capture the time_measured column. Since merge_asof snags the time column, I assigned it as a different column that I can use as intended. The use of df.assign(time_measured=df.time) just dups the column for use later.
  • I use assign again. This time to assign a new column seconds. When using assign, you can pass an array of equal length as the dataframe. You can pass a series in which the values will align based on the index. Or you can pass a callable that will get passed the dataframe that is calling assign. This is what I did. The lambda takes the calling dataframe and finds the difference in those two date columns and converts the resulting series of timedeltas to seconds.

Upvotes: 2

Related Questions