meronpan
meronpan

Reputation: 115

Pandas Dataframe Time Duration Expand to Minute Data

I am receiving data which consists of a 'StartTime' and a 'Duration' of time active. This is hard to work with when I need to do calculations on a specified time range over multiple days. I would like to break this data down to minutely data to make future calculations easier. Please see the example to get a better understanding.

Data which I currently have:

data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
        'Duration':[1,1,3,1,2],
        'Site':['1','2','3','4','5']    
}

df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')

What I would like to have:

data_expected = {'Time':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 04:37:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00','2019-01-02 05:14:00+11:00'],
        'Duration':[1,1,1,1,1,1,1],
        'Site':['1','2','3','3','4','5','5']    
}

df_expected = pd.DataFrame(data_expected)
df_expected['Time'] = pd.to_datetime(df_expected['Time']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')

I would like to see if anyone has a good solution for this problem. Effectively, I would need data rows with Duration >1 to be duplicated with time +1minute for each minute above 1 minute duration. Is there a way to do this without creating a whole new dataframe?

******** EDIT ********

In response to @DavidErickson 's answer. Putting this here because I can't put images in comments. I ran into a bit of trouble. df1 is a subset of the original dataframe. df2 is df1 after applying the code provided. You can see that the time that is added on to index 635 is incorrect. Comparison dataframes

Upvotes: 1

Views: 888

Answers (2)

David Erickson
David Erickson

Reputation: 16683

Use df.index.repeat according to the Duration column to add the relevant number of rows. Then create a mask with .groupby and cumcount that adds the appropriate number of minutes on top of the base time.

input:

data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
        'Duration':[1,1,2,1,2],
        'Site':['1','2','3','4','5']    
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])

code:

df = df.loc[df.index.repeat(df['Duration'])]
mask = df.groupby('Site').cumcount()
df['StartTime'] = df['StartTime'] + pd.to_timedelta(mask, unit='m')
df = df.append(df).sort_values('StartTime').assign(Duration=1).drop_duplicates()
df

output:

    StartTime                   Duration    Site
0   2018-12-30 12:45:00+11:00   1           1
1   2018-12-31 16:48:00+11:00   1           2
2   2019-01-01 04:36:00+11:00   1           3
2   2019-01-01 04:37:00+11:00   1           3
2   2019-01-01 04:38:00+11:00   1           3
3   2019-01-01 19:27:00+11:00   1           4
4   2019-01-02 05:13:00+11:00   1           5
4   2019-01-02 05:14:00+11:00   1           5

If you are running into memory issues, then you can also try with dask. I have included @jlandercy's pandas answer and changed to dask syntax as I'm not sure if the pandas operation index.repeat would work with dask. Here is documentation on the funcitons/operations. I would research the ones in the code https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table:

import dask.dataframe as dd
#read as a dask dataframe from csv or SQL or other
df = dd.read_csv(files) #df = dd.read_sql_table(table, uri, index_col='StartTime')
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = dd.explode('offset')
df['offset'] = df['offset'].apply(lambda x: dd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1

Upvotes: 0

jlandercy
jlandercy

Reputation: 11002

I think you might want to address use case where Duration > 2 as well.

For the modified given input:

data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
        'Duration':[1,1,3,1,2],
        'Site':['1','2','3','4','5']    
}

df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])

This code should do the trick:

df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = df.explode('offset')
df['offset'] = df['offset'].apply(lambda x: pd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1

Basically, it works as follow:

  • create a list of integer based on Duration value;
  • replicate row (explode) with consecutive integer offset;
  • transform integer offset into timedelta offset;
  • perform datetime arithmetics and reset Duration field.

The result is about:

                  StartTime  Duration Site   offset
0 2018-12-30 12:45:00+11:00         1    1 00:00:00
1 2018-12-31 16:48:00+11:00         1    2 00:00:00
2 2019-01-01 04:36:00+11:00         1    3 00:00:00
2 2019-01-01 04:37:00+11:00         1    3 00:01:00
2 2019-01-01 04:38:00+11:00         1    3 00:02:00
3 2019-01-01 19:27:00+11:00         1    4 00:00:00
4 2019-01-02 05:13:00+11:00         1    5 00:00:00
4 2019-01-02 05:14:00+11:00         1    5 00:01:00

Upvotes: 3

Related Questions