LaGabriella
LaGabriella

Reputation: 51

Get start_date and end_date for a given pattern in a dataframe

I have a dataframe with a column "signal" that takes 1 if there is a signal and 0 if it is not the case. I want to be able to summarize each event: one series of 1 is ONE event and i want to get the first timestamp at which it appeared and the last timestamp. Example below:

Input:

timestamp       col_1       col_2        signal
16:00:01.100     1000        2000        0
16:00:01.999     2500        2000        1
16:01:00.000     2000        2000        1
16:02:00.000     2000        3000        1
16:03:00.000     2000        2000        0
16:04:00.000     2000        1000        1
16:08:00.000     3000        3000        0
16:09:00.000     2000        2000        1
16:10:00.000     3000        3000        1
16:11:00.000     2000        2000        0

Expected output

start_time       end_time        signal_duration         
16:00:01.999     16:03:00.000    (end_time - start_time)
16:04:00.000     16:08:00.000
16:09:00.000     16:11:00.000

This way, I am able to calculate the duration of each signal individually in an additional column. A signal start time is when a serie of 1 starts, A signal end time is when a 0 follows. As you can see above, it s possible that there is only one row for a signal.

What I have tried: when I generate the data, I gave an id to each signal and then groupby. I don't like that kind of approach adding an additional column and the data is already heavy.

Thanks !

Upvotes: 1

Views: 173

Answers (1)

Code Different
Code Different

Reputation: 93141

Basically a gap-and-island problem: a signal that is 1 and not the same as the previous signal starts a new island. The expected output are the endpoints of the island and the overall duration of the island:

# Convert the `timestamp` column to type Timedetla or Timestamp
df["timestamp"] = pd.to_timedelta(df["timestamp"])
df["timestamp"] = pd.to_datetime(df["timestamp"])

# The calculation
s = df["signal"]
result = (
    df.assign(island=(s.eq(1) & s.ne(s.shift())).cumsum())
    .query("signal != 0")
    .groupby("island")
    .agg(start_time=("timestamp", "min"), end_time=("timestamp", "max"))
    .assign(signal_duration=lambda x: x["end_time"] - x["start_time"])
)

What it does:

  1. When signal == 1 and not the same as the previous row, start a new island
  2. Select rows where signal != 0
  3. Group the rows into their islands
  4. Find the start and end time of each island
  5. Calculate the signal duration

Upvotes: 1

Related Questions