Reputation: 10375
I have a table which looks something like this
import numpy as np
import pandas as pd
tmp=[["","5-9",""],["","",""],["17-","","4- -9 27-"],["-6","",""],["","","-15"]]
dat=pd.DataFrame(tmp).rename(columns={0:"V0",1:"V1",2:"V2"})
dat["Month"]=np.arange(1,6)
dat["Year"]=np.repeat(2015,5)
V0 V1 V2 Month Year
0 5-9 1 2015
1 2 2015
2 17- 4- -9 27- 3 2015
3 -6 4 2015
4 -15 5 2015
...
The numbers in the table represent the days (in the month) when a certain event happened. Note: months can have multiple events and events can span over multiple months.
V1, V2 and V3 are three different devices, each having its own separate events. So we have three different time series.
I would like to convert this table to a time series data frame, that is break it down per day for each device. Each row would be one day for one month (for one year) and each column would now only have values of 0 or 1, 0 if no event happened on that day, 1 otherwise (dummy variable). The result would contain three different time series, one for each device. How would I do that?
This is what the output would look like
V0 V1 V2 Day Month Year
0 0 0 0 1 1 2015
1 0 0 0 2 1 2015
2 0 0 0 3 1 2015
3 0 0 0 4 1 2015
4 0 0 0 5 1 2015
5 0 1 0 6 1 2015
6 0 1 0 7 1 2015
7 0 1 0 8 1 2015
8 0 1 0 9 1 2015
9 0 1 0 10 1 2015
10 0 0 0 11 1 2015
11 0 0 0 12 1 2015
12 0 0 0 13 1 2015
...
Upvotes: 2
Views: 252
Reputation: 1897
You can do this with a series of transformations as shown below. Don't know if this is the most efficient way of doing this ...
import numpy as np
import pandas as pd
tmp=[["","5-9",""],["","",""],["17-","","4- -9 27-"],["-6","",""],["","","-15"]]
df=pd.DataFrame(tmp).rename(columns={0:"V0",1:"V1",2:"V2"})
df["Month"]=np.arange(1,6)
df["Year"]=np.repeat(2015,5)
print(df)
V0 V1 V2 Month Year
0 5-9 1 2015
1 2 2015
2 17- 4- -9 27- 3 2015
3 -6 4 2015
4 -15 5 2015
1. Stack Only Non-Empty Values
days = df.set_index(['Year', 'Month']).stack().replace('', np.nan).dropna()
print(days)
Year Month
2015 1 V1 5-9
3 V0 17-
V2 4- -9 27-
4 V0 -6
5 V2 -15
dtype: object
2. Expand Date Ranges
String such as "5-9"
need to be converted to an array with length 31, with values between indices 5 - 9 set to 1 and rest to 0. And similarly, for the other rows. This is a string parsing problem left as an exercise :-). In my example below, I am hard coding the solution based on the values in the question.
def _fill(arr, start, stop):
arr[np.arange(start-1, stop)] = 1
return arr
def expand_days(df_in):
df_out = df_in.copy()
days_all = np.zeros(31)
df_out.loc[2015, 1, 'V1'] = _fill(days_all.copy(), 5, 9)
df_out.loc[2015, 3, 'V0'] = _fill(days_all.copy(), 17, 31)
df_out.loc[2015, 3, 'V2'] = _fill(_fill(days_all.copy(), 4, 9), 27, 31)
df_out.loc[2015, 4, 'V0'] = _fill(days_all.copy(), 1, 6)
df_out.loc[2015, 5, 'V2'] = _fill(days_all.copy(), 1, 15)
return df_out
days_ex = expand_days(days)
print(days_ex)
Year Month
2015 1 V1 [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
3 V0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
V2 [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
4 V0 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, ...
5 V2 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, ...
dtype: object
3. Convert an Array to a series of Columns
days_fr = days_ex.apply(lambda x: pd.Series(x, index=np.arange(1, 32)))
print(days_fr)
1 2 3 4 5 6 7 8 9 10 ... 22 \
Year Month ...
2015 1 V1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0
3 V0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0
V2 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0
4 V0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0
5 V2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 0.0
23 24 25 26 27 28 29 30 31
Year Month
2015 1 V1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 V0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
V2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0
4 V0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 V2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[5 rows x 31 columns]
4. Set Correct Index Names And Stack
days_unstacked = days_fr.stack()
days_unstacked.index.set_names(['Year', 'Month', 'Devices', 'Days'], inplace=True)
print(days_unstacked.head())
Year Month Devices Days
2015 1 V1 1 0.0
2 0.0
3 0.0
4 0.0
5 1.0
dtype: float64
5. Unstack And Fill NA's With Zeros
days_stacked = days_unstacked.unstack('Devices').fillna(0).reset_index()
print(days_stacked.head(10))
Devices Year Month Days V0 V1 V2
0 2015 1 1 0.0 0.0 0.0
1 2015 1 2 0.0 0.0 0.0
2 2015 1 3 0.0 0.0 0.0
3 2015 1 4 0.0 0.0 0.0
4 2015 1 5 0.0 1.0 0.0
5 2015 1 6 0.0 1.0 0.0
6 2015 1 7 0.0 1.0 0.0
7 2015 1 8 0.0 1.0 0.0
8 2015 1 9 0.0 1.0 0.0
9 2015 1 10 0.0 0.0 0.0
The index name of the resulting frame is set to Devices
which is an artifact of how we setup the problem. It will need to be changed to something else.
Upvotes: 1