Reputation: 123
How do i transform time interval data into time series data using Python (and pandas)?
Here's my before data frame as time intervals:
code start_dt end_dt ent_value
156600 1960-01-01 2016-04-21 H:CXP
156600 1960-01-01 2016-01-03 46927
156600 1998-08-31 2016-01-03 5516751
156600 1960-01-01 1998-08-30 4501242
For each combination of code and ent_value, we want a row in the frame for each day within that combination's start and end date (so as a time series):
code as_of_dt ent_value
156600 1960-01-01 H:CXP
156600 1960-01-02 H:CXP
156600 1960-01-03 H:CXP
156600 1960-01-01 46927
156600 1960-01-02 46927
156600 1960-01-03 46927
156600 1960-01-01 5516751
156600 1960-01-02 5516751
156600 1960-01-03 5516751
...
156600 2016-01-01 H:CXP
156600 2016-01-02 H:CXP
156600 2016-01-03 H:CXP
156600 2016-01-01 46927
156600 2016-01-02 46927
156600 2016-01-03 46927
156600 2016-01-01 5516751
156600 2016-01-02 5516751
156600 2016-01-03 5516751
How do I do this in an efficient manner?
Upvotes: 2
Views: 3661
Reputation: 19037
Say you have the following DataFrame, called df
(see below to check how to create it):
(see below to recreate this example)
id starttime endtime flag
0 A 2020-03-18 2020-03-20 y
1 B 2020-03-20 2020-03-23 n
2 C 2020-03-19 2020-03-21 y
Then, you can create the new dataframe, by iterating over all columns with the help of date_range:
new_df = pd.DataFrame(
data = ((row.id, row.flag, date)
# iterate over rows
for row in df.itertuples()
# expad the range into 1 day intervals
for date in pd.date_range(row.starttime, row.endtime, freq='1D')),
columns = ['name', 'flag', 'interval']))
You will end with this:
name flag interval
0 A y 2020-03-18
1 A y 2020-03-19
2 A y 2020-03-20
3 B n 2020-03-20
4 B n 2020-03-21
5 B n 2020-03-22
6 B n 2020-03-23
7 C y 2020-03-19
8 C y 2020-03-20
9 C y 2020-03-21
import pandas as pd
df = pd.DataFrame({
'id': ['A', 'B', 'C'],
'starttime': ['2020-03-18', '2020-03-20','2020-03-19' ],
'endtime': ['2020-03-20', '2020-03-23','2020-03-21'],
'flag': ['y','n','y']
})
df['starttime'] = pd.to_datetime(df['starttime'])
df['endtime'] = pd.to_datetime(df['endtime'])
Upvotes: 0
Reputation: 859
This is a possible solution.
data = pd.read_csv(open('/tmp/test.tab', 'r'), sep='\t')
tmp = [(e.code, pd.date_range(e.start_dt, e.end_dt, freq='1D'),
e.ent_value) for e in data.itertuples()]
res = [(line[0], date, line[2]) for date in line[1] for line in tmp]
df = pd.DataFrame(res)`
The function pd.date_range()
is used to create the dates ranges.
Upvotes: 1
Reputation: 210972
try this:
In [17]: %paste
(df.groupby(['code','ent_value'])
.apply(lambda x: pd.DataFrame({'as_of_dt':pd.date_range(x.start_dt.min(), x.end_dt.max())}))
.reset_index()
.drop('level_2', 1)
)
## -- End pasted text --
Out[17]:
code ent_value as_of_dt
0 156600 4501242 1960-01-01
1 156600 4501242 1960-01-02
2 156600 4501242 1960-01-03
3 156600 4501242 1960-01-04
4 156600 4501242 1960-01-05
5 156600 4501242 1960-01-06
6 156600 4501242 1960-01-07
7 156600 4501242 1960-01-08
8 156600 4501242 1960-01-09
9 156600 4501242 1960-01-10
10 156600 4501242 1960-01-11
11 156600 4501242 1960-01-12
12 156600 4501242 1960-01-13
13 156600 4501242 1960-01-14
14 156600 4501242 1960-01-15
15 156600 4501242 1960-01-16
16 156600 4501242 1960-01-17
17 156600 4501242 1960-01-18
18 156600 4501242 1960-01-19
19 156600 4501242 1960-01-20
20 156600 4501242 1960-01-21
21 156600 4501242 1960-01-22
22 156600 4501242 1960-01-23
23 156600 4501242 1960-01-24
24 156600 4501242 1960-01-25
25 156600 4501242 1960-01-26
26 156600 4501242 1960-01-27
27 156600 4501242 1960-01-28
28 156600 4501242 1960-01-29
29 156600 4501242 1960-01-30
... ... ... ...
61450 156600 H:CXP 2016-03-23
61451 156600 H:CXP 2016-03-24
61452 156600 H:CXP 2016-03-25
61453 156600 H:CXP 2016-03-26
61454 156600 H:CXP 2016-03-27
61455 156600 H:CXP 2016-03-28
61456 156600 H:CXP 2016-03-29
61457 156600 H:CXP 2016-03-30
61458 156600 H:CXP 2016-03-31
61459 156600 H:CXP 2016-04-01
61460 156600 H:CXP 2016-04-02
61461 156600 H:CXP 2016-04-03
61462 156600 H:CXP 2016-04-04
61463 156600 H:CXP 2016-04-05
61464 156600 H:CXP 2016-04-06
61465 156600 H:CXP 2016-04-07
61466 156600 H:CXP 2016-04-08
61467 156600 H:CXP 2016-04-09
61468 156600 H:CXP 2016-04-10
61469 156600 H:CXP 2016-04-11
61470 156600 H:CXP 2016-04-12
61471 156600 H:CXP 2016-04-13
61472 156600 H:CXP 2016-04-14
61473 156600 H:CXP 2016-04-15
61474 156600 H:CXP 2016-04-16
61475 156600 H:CXP 2016-04-17
61476 156600 H:CXP 2016-04-18
61477 156600 H:CXP 2016-04-19
61478 156600 H:CXP 2016-04-20
61479 156600 H:CXP 2016-04-21
[61480 rows x 3 columns]
Test DF with smaller date ranges:
In [19]: df
Out[19]:
code start_dt end_dt ent_value
0 156600 1960-01-01 1960-01-04 H:CXP
1 156600 1960-01-04 1960-01-09 46927
2 156600 1998-08-31 1998-09-04 5516751
3 156600 1965-01-01 1965-01-04 4501242
In [20]: (df.groupby(['code','ent_value'])
....: .apply(lambda x: pd.DataFrame({'as_of_dt':pd.date_range(x.start_dt.min(), x.end_dt.max())}))
....: .reset_index()
....: .drop('level_2', 1)
....: )
Out[20]:
code ent_value as_of_dt
0 156600 4501242 1965-01-01
1 156600 4501242 1965-01-02
2 156600 4501242 1965-01-03
3 156600 4501242 1965-01-04
4 156600 46927 1960-01-04
5 156600 46927 1960-01-05
6 156600 46927 1960-01-06
7 156600 46927 1960-01-07
8 156600 46927 1960-01-08
9 156600 46927 1960-01-09
10 156600 5516751 1998-08-31
11 156600 5516751 1998-09-01
12 156600 5516751 1998-09-02
13 156600 5516751 1998-09-03
14 156600 5516751 1998-09-04
15 156600 H:CXP 1960-01-01
16 156600 H:CXP 1960-01-02
17 156600 H:CXP 1960-01-03
18 156600 H:CXP 1960-01-04
Upvotes: 1