Reputation: 641
I would like to filter and split my original dataframe into a number of dataframes using the condition that progressPercentage goes from 1.0 to 100 as in the following example:
Input:
id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,23.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,24.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
Outputs:
id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,2.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,3.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,4.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:01:00,100.0
I have been using the .shift() and groupby as in the following:
a = dfb['Operation.progressPercentage'].shift().eq(100)
grouping = dfb.groupby([dfb.wm_id,a])
but it did not provide the expected results. Please, any help on how I should change the code to get it done?
Many Thanks in advance. Best Regards, Carlo
Upvotes: 1
Views: 131
Reputation: 862471
If Finish
value sometimes missing and need use only progressPercentage
column use:
shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()
aa = df.groupby([df.id_B,a])
for k, gp in aa:
print('key=' + str(k))
print(gp)
print('A NEW ONE...')
key=('id1', 0)
id_B ts_B course weight Phase remainingTime \
0 id1 2017-04-27 01:35:30 cotton 3.5 A 01:15:00
1 id1 2017-04-27 01:37:30 cotton 3.5 B 01:13:00
2 id1 2017-04-27 01:38:00 cotton 3.5 B 01:13:00
3 id1 2017-04-27 01:38:30 cotton 3.5 C 01:13:00
4 id1 2017-04-27 01:39:00 cotton 3.5 C 00:02:00
5 id1 2017-04-27 01:39:30 cotton 3.5 C 00:01:00
6 id1 2017-04-27 01:40:00 cotton 3.5 Finish 00:01:00
progressPercentage
0 23.0
1 24.0
2 24.0
3 24.0
4 99.0
5 100.0
6 100.0
A NEW ONE...
key=('id1', 1)
id_B ts_B course weight Phase remainingTime \
7 id1 2017-04-27 02:35:30 cotton 3.5 A 03:15:00
8 id1 2017-04-27 02:36:00 cotton 3.5 A 03:14:00
9 id1 2017-04-27 02:36:30 cotton 3.5 A 03:14:00
10 id1 2017-04-27 02:37:00 cotton 3.5 B 03:13:00
11 id1 2017-04-27 02:37:30 cotton 3.5 B 03:13:00
12 id1 2017-04-27 02:38:00 cotton 3.5 B 03:13:00
13 id1 2017-04-27 02:38:30 cotton 3.5 C 03:13:00
14 id1 2017-04-27 02:39:00 cotton 3.5 C 00:02:00
15 id1 2017-04-27 02:39:30 cotton 3.5 C 00:01:00
16 id1 2017-04-27 02:40:00 cotton 3.5 Finish 00:01:00
progressPercentage
7 1.0
8 2.0
9 2.0
10 3.0
11 4.0
12 5.0
13 98.0
14 99.0
15 100.0
16 100.0
A NEW ONE...
key=('id2', 2)
...
Upvotes: 2
Reputation: 30605
You can divide the dataframe by progressPercentage which is equal 100. Remove the earlier index if they are consecutive.Then slice and append the dataframe to an array. Hope this helps
import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe.
x = [ begin for begin, end in zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] for begin, end in zip(x, x[1:])]
You can print the dataframes using a for loop, i.e
for df in frames:
print(df)
Output of the dataframes:
id_B ts_B course weight Phase remainingTime \ 0 id1 2017-04-27 01:35:30 cotton 3.5 A 01:15:00 1 id1 2017-04-27 01:37:30 cotton 3.5 B 01:13:00 2 id1 2017-04-27 01:38:00 cotton 3.5 B 01:13:00 3 id1 2017-04-27 01:38:30 cotton 3.5 C 01:13:00 4 id1 2017-04-27 01:39:00 cotton 3.5 C 00:02:00 5 id1 2017-04-27 01:39:30 cotton 3.5 C 00:01:00 6 id1 2017-04-27 01:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 0 23.0 1 24.0 2 24.0 3 24.0 4 99.0 5 100.0 6 100.0 id_B ts_B course weight Phase remainingTime \ 7 id1 2017-04-27 02:35:30 cotton 3.5 A 03:15:00 8 id1 2017-04-27 02:36:00 cotton 3.5 A 03:14:00 9 id1 2017-04-27 02:36:30 cotton 3.5 A 03:14:00 10 id1 2017-04-27 02:37:00 cotton 3.5 B 03:13:00 11 id1 2017-04-27 02:37:30 cotton 3.5 B 03:13:00 12 id1 2017-04-27 02:38:00 cotton 3.5 B 03:13:00 13 id1 2017-04-27 02:38:30 cotton 3.5 C 03:13:00 14 id1 2017-04-27 02:39:00 cotton 3.5 C 00:02:00 15 id1 2017-04-27 02:39:30 cotton 3.5 C 00:01:00 16 id1 2017-04-27 02:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 7 1.0 8 2.0 9 2.0 10 3.0 11 4.0 12 5.0 13 98.0 14 99.0 15 100.0 16 100.0 id_B ts_B course weight Phase remainingTime \ 17 id2 2017-04-27 03:36:00 cotton 3.5 A 03:15:00 18 id2 2017-04-27 03:36:30 cotton 3.5 A 03:14:00 19 id2 2017-04-27 03:37:00 cotton 3.5 B 03:13:00 20 id2 2017-04-27 03:37:30 cotton 3.5 B 03:13:00 21 id2 2017-04-27 03:38:00 cotton 3.5 B 03:13:00 22 id2 2017-04-27 03:38:30 cotton 3.5 C 03:13:00 23 id2 2017-04-27 03:39:00 cotton 3.5 C 00:02:00 24 id2 2017-04-27 03:39:30 cotton 3.5 C 00:01:00 25 id2 2017-04-27 03:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 17 1.0 18 1.0 19 2.0 20 2.0 21 3.0 22 98.0 23 99.0 24 100.0 25 100.0 id_B ts_B course weight Phase remainingTime \ 26 id1 2017-05-27 01:35:30 cotton 3.5 A 03:15:00 27 id1 2017-05-27 01:37:30 cotton 3.5 B 03:13:00 28 id1 2017-05-27 01:38:00 cotton 3.5 B 03:13:00 29 id1 2017-05-27 01:38:30 cotton 3.5 C 03:13:00 30 id1 2017-05-27 01:39:00 cotton 3.5 C 00:02:00 31 id1 2017-05-27 01:39:30 cotton 3.5 C 00:01:00 32 id1 2017-05-27 01:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 26 23.0 27 24.0 28 24.0 29 24.0 30 99.0 31 100.0 32 100.0 id_B ts_B course weight Phase remainingTime \ 33 id1 2017-05-27 02:35:30 cotton 3.5 A 01:15:00 34 id1 2017-05-27 02:36:00 cotton 3.5 A 01:14:00 35 id1 2017-05-27 02:36:30 cotton 3.5 A 01:13:00 36 id1 2017-05-27 02:37:00 cotton 3.5 B 01:12:00 37 id1 2017-05-27 02:37:30 cotton 3.5 B 01:11:00 38 id1 2017-05-27 02:38:00 cotton 3.5 B 01:10:00 39 id1 2017-05-27 02:38:30 cotton 3.5 C 01:09:00 40 id1 2017-05-27 02:39:00 cotton 3.5 C 00:08:00 41 id1 2017-05-27 02:39:00 cotton 3.5 C 00:08:00 progressPercentage 33 1.0 34 2.0 35 2.0 36 3.0 37 4.0 38 5.0 39 98.0 40 99.0 41 100.0 id_B ts_B course weight Phase remainingTime \ 42 id2 2017-04-27 03:36:00 cotton 3.5 A 03:15:00 43 id2 2017-04-27 03:36:30 cotton 3.5 A 03:14:00 44 id2 2017-04-27 03:37:00 cotton 3.5 B 03:13:00 45 id2 2017-04-27 03:37:30 cotton 3.5 B 03:13:00 46 id2 2017-04-27 03:38:00 cotton 3.5 B 03:13:00 47 id2 2017-04-27 03:38:30 cotton 3.5 C 03:13:00 48 id2 2017-04-27 03:39:00 cotton 3.5 C 00:02:00 49 id2 2017-04-27 03:39:30 cotton 3.5 C 00:01:00 50 id2 2017-04-27 03:40:00 cotton 3.5 Finish 00:01:00 progressPercentage 42 1.0 43 1.0 44 2.0 45 2.0 46 3.0 47 98.0 48 99.0 49 100.0 50 100.0
Upvotes: 2
Reputation: 641
the best way I found is the following:
a = dfb['progressPercentage'].shift().eq(100).cumsum()
df_output = dfb.groupby([dfb.id_B,a])
for k, gp in aa:
print('key=' + str(k))
print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
print('A NEW ONE...')
Upvotes: 0