Reputation: 63
I have a pandas dataframe that represents a shift schedule for an entire year, given as:
January 2019 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Shift A 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1
Shift B 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2
Shift C 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0
Shift D 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0
February 2019 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 nan nan nan
Shift A 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 nan nan nan
Shift B 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 nan nan nan
Shift C 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 nan nan nan
Shift D 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 nan nan nan
Where 1 represents Day shift (06:00 - 18:00), 2 represents Night shift (18:00 - 06:00) and 0 can be ignored. Only a single shift team will be working for a given period.
I need the data in a format where the data is indexed by the DateTime stamp with the current working shift, e.g. :
DateTime Shift
0 2019-01-01 06:00:00 A
1 2019-01-01 18:00:00 D
2 2019-01-02 06:00:00 A
3 2019-01-02 18:00:00 B
4 2019-01-03 06:00:00 A
5 2019-01-03 18:00:00 B
.
.
.
What would be the most efficient Pandas method to re-index the data to achieve this, i.e. avoiding for loops?
Upvotes: 6
Views: 236
Reputation: 5471
you can do it using numpy, as below
Generate Dummy Data
df = pd.DataFrame([['January 2019', 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0], ['Shift A', 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0], ['Shift B', 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0], ['Shift C', 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], ['Shift D', 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0], ['February 2019', 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, np.nan, np.nan, np.nan], ['Shift A', 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, np.nan, np.nan, np.nan], ['Shift B', 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, np.nan, np.nan, np.nan], ['Shift C', 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, np.nan, np.nan], ['Shift D', 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, np.nan, np.nan, np.nan]], columns=('days', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'))
df
Solution
import pandas as pd
import numpy as np
## some constants
TEAMS = np.array("A,B,C,D".split(","))
TEAM_COUNT=len(TEAMS)
MAP = {1: pd.Timedelta('06:00:00'), 2:pd.Timedelta('18:00:00')}
## first transpose the table and get numpy array rows except heading
arr = df.copy().T.values[1:, :].astype(np.float)
## break array in chunk for each month and create long array
arr = np.c_[[arr[:,sec+1:sec+(TEAM_COUNT+1)]
for sec in np.arange(0,arr.shape[1], (TEAM_COUNT+1))]]
## remove data with nan and flatten the array to get one row for each team
arr = arr[~np.isnan(arr)]
## get date range from start to end dates and repeate that for team size
date_index = pd.date_range("jan-1-2019", "feb-28-2019").repeat(TEAM_COUNT)
## create dataframe
df2 = pd.DataFrame(arr, index=date_index, columns=["shift"])
## set team column for each row
df2["team"] = np.tile(TEAMS, len(df2)//len(TEAMS))
## map shift to time delta
df2 = df2[df2["shift"]!=0]
df2["shift"] = df2["shift"].map(MAP)
df2
Result
shift team
2019-01-01 06:00:00 A
2019-01-01 18:00:00 D
2019-01-02 06:00:00 A
2019-01-02 18:00:00 B
2019-01-03 06:00:00 A
... ... ...
2019-02-26 18:00:00 D
2019-02-27 06:00:00 A
2019-02-27 18:00:00 B
2019-02-28 06:00:00 A
2019-02-28 18:00:00 B
Upvotes: 1
Reputation: 863361
Use:
#get first column by position
first = df.iloc[:, 0]
#convert column to datetimes with missing values for no datetimes values
dates = pd.to_datetime(first, errors='coerce')
#mask for data row
mask = dates.isna()
#forward filling missing values and replace first NaNs by first column name
df.index = dates.ffill().fillna(pd.to_datetime(first.name))
#filter out rows with datetimes in first column, add first column to index
df = df[mask.values].set_index(first.name, append=True)
#convert columns names to timedeltas in days, first is 0 days
df.columns = pd.to_timedelta(df.columns.astype(int) - 1, unit='D')
#dictionary for map 1, 2 values
mapp = {1: pd.Timedelta('06:00:00'), 2:pd.Timedelta('18:00:00')}
#remove 0 rows with convert to NaN by mask and reshape by stack
#map by dict and convert MultiIndex to columns
df = (df.mask(df == 0)
.stack()
.map(mapp)
.rename_axis(('Datetime','Shift', 'day'))
.reset_index(name='td')
)
#add days to hours and add to Datetime
df['Datetime'] += (df.pop('td') + df.pop('day'))
#sorting ans create default index
df = df.sort_values(['Datetime','Shift']).reset_index(drop=True)
print (df)
Datetime Shift
0 2019-01-01 06:00:00 Shift A
1 2019-01-01 18:00:00 Shift D
2 2019-01-02 06:00:00 Shift A
3 2019-01-02 18:00:00 Shift B
4 2019-01-03 06:00:00 Shift A
.. ... ...
113 2019-02-26 18:00:00 Shift D
114 2019-02-27 06:00:00 Shift A
115 2019-02-27 18:00:00 Shift B
116 2019-02-28 06:00:00 Shift A
117 2019-02-28 18:00:00 Shift B
[118 rows x 2 columns]
Upvotes: 1