Reputation: 641

splitting a pandas Dataframe

I would like to filter and split my original dataframe into a number of dataframes using the condition that progressPercentage goes from 1.0 to 100 as in the following example:

Input:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,23.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,24.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0

Outputs:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,2.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,3.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,4.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:01:00,100.0

I have been using the .shift() and groupby as in the following:

 a = dfb['Operation.progressPercentage'].shift().eq(100)
 grouping = dfb.groupby([dfb.wm_id,a])

but it did not provide the expected results. Please, any help on how I should change the code to get it done?

Many Thanks in advance. Best Regards, Carlo

Upvotes: 1

Answers (3)

jezrael

Reputation: 862471

If Finish value sometimes missing and need use only progressPercentage column use:

shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)   
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()

aa = df.groupby([df.id_B,a])

for k, gp in aa: 
    print('key=' + str(k))
    print(gp)
    print('A NEW ONE...')

key=('id1', 0)
  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage  
0                23.0  
1                24.0  
2                24.0  
3                24.0  
4                99.0  
5               100.0  
6               100.0  
A NEW ONE...
key=('id1', 1)
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage  
7                  1.0  
8                  2.0  
9                  2.0  
10                 3.0  
11                 4.0  
12                 5.0  
13                98.0  
14                99.0  
15               100.0  
16               100.0  
A NEW ONE...
key=('id2', 2)

...

Upvotes: 2

Bharath M Shetty

Reputation: 30605

You can divide the dataframe by progressPercentage which is equal 100. Remove the earlier index if they are consecutive.Then slice and append the dataframe to an array. Hope this helps

import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe. 
x = [ begin for begin, end in zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] for begin, end in zip(x, x[1:])]

You can print the dataframes using a for loop, i.e

for df in frames:
    print(df)

Output of the dataframes:

  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage   
0                 23.0  
1                 24.0  
2                 24.0  
3                 24.0  
4                 99.0  
5                100.0  
6                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
7                   1.0  
8                   2.0  
9                   2.0  
10                  3.0  
11                  4.0  
12                  5.0  
13                 98.0  
14                 99.0  
15                100.0  
16                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
17  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
18  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
19  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
20  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
21  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
22  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
23  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
24  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
25  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
17                  1.0  
18                  1.0  
19                  2.0  
20                  2.0  
21                  3.0  
22                 98.0  
23                 99.0  
24                100.0  
25                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
26  id1  2017-05-27 01:35:30  cotton     3.5       A      03:15:00   
27  id1  2017-05-27 01:37:30  cotton     3.5       B      03:13:00   
28  id1  2017-05-27 01:38:00  cotton     3.5       B      03:13:00   
29  id1  2017-05-27 01:38:30  cotton     3.5       C      03:13:00   
30  id1  2017-05-27 01:39:00  cotton     3.5       C      00:02:00   
31  id1  2017-05-27 01:39:30  cotton     3.5       C      00:01:00   
32  id1  2017-05-27 01:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
26                 23.0  
27                 24.0  
28                 24.0  
29                 24.0  
30                 99.0  
31                100.0  
32                100.0  
   id_B                 ts_B  course  weight Phase remainingTime  \
33  id1  2017-05-27 02:35:30  cotton     3.5     A      01:15:00   
34  id1  2017-05-27 02:36:00  cotton     3.5     A      01:14:00   
35  id1  2017-05-27 02:36:30  cotton     3.5     A      01:13:00   
36  id1  2017-05-27 02:37:00  cotton     3.5     B      01:12:00   
37  id1  2017-05-27 02:37:30  cotton     3.5     B      01:11:00   
38  id1  2017-05-27 02:38:00  cotton     3.5     B      01:10:00   
39  id1  2017-05-27 02:38:30  cotton     3.5     C      01:09:00   
40  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   
41  id1  2017-05-27 02:39:00  cotton     3.5     C      00:08:00   

    progressPercentage   
33                  1.0  
34                  2.0  
35                  2.0  
36                  3.0  
37                  4.0  
38                  5.0  
39                 98.0  
40                 99.0  
41                100.0  
   id_B                 ts_B  course  weight   Phase remainingTime  \
42  id2  2017-04-27 03:36:00  cotton     3.5       A      03:15:00   
43  id2  2017-04-27 03:36:30  cotton     3.5       A      03:14:00   
44  id2  2017-04-27 03:37:00  cotton     3.5       B      03:13:00   
45  id2  2017-04-27 03:37:30  cotton     3.5       B      03:13:00   
46  id2  2017-04-27 03:38:00  cotton     3.5       B      03:13:00   
47  id2  2017-04-27 03:38:30  cotton     3.5       C      03:13:00   
48  id2  2017-04-27 03:39:00  cotton     3.5       C      00:02:00   
49  id2  2017-04-27 03:39:30  cotton     3.5       C      00:01:00   
50  id2  2017-04-27 03:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage   
42                  1.0  
43                  1.0  
44                  2.0  
45                  2.0  
46                  3.0  
47                 98.0  
48                 99.0  
49                100.0  
50                100.0

Upvotes: 2

Carlo Allocca

Reputation: 641

the best way I found is the following:

    a = dfb['progressPercentage'].shift().eq(100).cumsum()
    df_output = dfb.groupby([dfb.id_B,a])

    for k, gp in aa: 
        print('key=' + str(k))
        print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
        print('A NEW ONE...')

Upvotes: 0

splitting a pandas Dataframe

Answers (3)

Related Questions