Reputation: 115
I am receiving data which consists of a 'StartTime' and a 'Duration' of time active. This is hard to work with when I need to do calculations on a specified time range over multiple days. I would like to break this data down to minutely data to make future calculations easier. Please see the example to get a better understanding.
Data which I currently have:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
What I would like to have:
data_expected = {'Time':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 04:37:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00','2019-01-02 05:14:00+11:00'],
'Duration':[1,1,1,1,1,1,1],
'Site':['1','2','3','3','4','5','5']
}
df_expected = pd.DataFrame(data_expected)
df_expected['Time'] = pd.to_datetime(df_expected['Time']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
I would like to see if anyone has a good solution for this problem. Effectively, I would need data rows with Duration >1 to be duplicated with time +1minute for each minute above 1 minute duration. Is there a way to do this without creating a whole new dataframe?
******** EDIT ********
In response to @DavidErickson 's answer. Putting this here because I can't put images in comments. I ran into a bit of trouble. df1 is a subset of the original dataframe. df2 is df1 after applying the code provided. You can see that the time that is added on to index 635 is incorrect.
Upvotes: 1
Views: 888
Reputation: 16683
Use df.index.repeat
according to the Duration
column to add the relevant number of rows. Then create a mask with .groupby
and cumcount
that adds the appropriate number of minutes on top of the base time.
input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,2,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
code:
df = df.loc[df.index.repeat(df['Duration'])]
mask = df.groupby('Site').cumcount()
df['StartTime'] = df['StartTime'] + pd.to_timedelta(mask, unit='m')
df = df.append(df).sort_values('StartTime').assign(Duration=1).drop_duplicates()
df
output:
StartTime Duration Site
0 2018-12-30 12:45:00+11:00 1 1
1 2018-12-31 16:48:00+11:00 1 2
2 2019-01-01 04:36:00+11:00 1 3
2 2019-01-01 04:37:00+11:00 1 3
2 2019-01-01 04:38:00+11:00 1 3
3 2019-01-01 19:27:00+11:00 1 4
4 2019-01-02 05:13:00+11:00 1 5
4 2019-01-02 05:14:00+11:00 1 5
If you are running into memory issues, then you can also try with dask
. I have included @jlandercy's pandas answer and changed to dask
syntax as I'm not sure if the pandas operation index.repeat
would work with dask
. Here is documentation on the funcitons/operations. I would research the ones in the code https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table:
import dask.dataframe as dd
#read as a dask dataframe from csv or SQL or other
df = dd.read_csv(files) #df = dd.read_sql_table(table, uri, index_col='StartTime')
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = dd.explode('offset')
df['offset'] = df['offset'].apply(lambda x: dd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
Upvotes: 0
Reputation: 11002
I think you might want to address use case where Duration > 2
as well.
For the modified given input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
This code should do the trick:
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = df.explode('offset')
df['offset'] = df['offset'].apply(lambda x: pd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
Basically, it works as follow:
explode
) with consecutive integer offset;timedelta
offset;datetime
arithmetics and reset Duration field.The result is about:
StartTime Duration Site offset
0 2018-12-30 12:45:00+11:00 1 1 00:00:00
1 2018-12-31 16:48:00+11:00 1 2 00:00:00
2 2019-01-01 04:36:00+11:00 1 3 00:00:00
2 2019-01-01 04:37:00+11:00 1 3 00:01:00
2 2019-01-01 04:38:00+11:00 1 3 00:02:00
3 2019-01-01 19:27:00+11:00 1 4 00:00:00
4 2019-01-02 05:13:00+11:00 1 5 00:00:00
4 2019-01-02 05:14:00+11:00 1 5 00:01:00
Upvotes: 3