Reputation: 57
I have a pandas dataframe with 50 million records and what I am trying to do is backfill based on a condition. As we can see that the timestamps for name 800A and Barber align so I assume that the data belongs to same name and it is just an error while recording the data. The same goes with name Mia.
This is just the sample data.
my dataframe looks like this.
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
Now I am trying to backfill the record numbers(rec_num) column until it maps the bool condition False False in both the x_inc and vs_inc columns.
Actual output:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
Expected output:
datetime name dischargeDate HR Sp x_inc vs_inc rec_num
01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445
01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445
01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783
01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783
03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923
03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923
I am using df['rec_num'].fillna(method='bfill')
but it fills completely which is not my ideal solution. I would appreciate if I can get any suggestions to this problem(or if there is any better approach). Thanks in advance.
Upvotes: 4
Views: 1559
Reputation: 75080
Using a boolean mask and np.where()
you can use this:
cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)
Note : you can reassign the values to the old column named rec_num
instead of creating a new column. I added that so you could compare. Also this should be the fastest method since vectorized
datetime name dischargeDate HR Sp x_inc vs_inc rec_num new_rec
0 2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0 98.0 False True 6458445.0 6458445.0
1 2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0 97.0 False True 6458445.0 6458445.0
2 2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN NaN False False NaN NaN
3 2029-01-01 17:58:45 800A 2029-01-14 17:59:10 NaN NaN False False NaN NaN
4 2029-01-01 17:58:50 800A 2029-01-14 17:59:10 139.0 NaN False True NaN 7192783.0
5 2029-01-01 17:58:55 800A 2029-01-14 17:59:10 138.0 NaN False True NaN 7192783.0
...........................................................
...........................................................
....................................................
.....................................
Upvotes: 4
Reputation: 2400
You can use apply.
Create the function:
def foo(x):
if not bool(x['epic_include']) and not bool(x['vs_include']):
return None
else:
if not pd.isna(x['twist_mrn']):
return x['twist_mrn']
else:
return df['twist_mrn'].iloc[df.iloc[x.name:]['twist_mrn'].first_valid_index()]
So, apply:
df['twist_mrn'] = df.apply(foo, axis=1)
Output:
datetime patient_name dischargeDate HR SpO2 epic_include vs_include twist_mrn
0 2014-01-05 18:04:50 Zawisza 2014-01-05 18:05:00 119.0 98.0 False True 4654843.0
1 2014-01-05 18:04:55 Zawisza 2014-01-05 18:05:00 120.0 97.0 False True 4654843.0
2 2014-01-05 18:05:00 Zawisza 2014-01-05 18:05:00 NaN NaN False False NaN
3 2014-01-29 17:58:45 800A 2014-01-29 17:59:10 NaN NaN False False NaN
4 2014-01-29 17:58:50 800A 2014-01-29 17:59:10 139.0 NaN False True 4719848.0
5 2014-01-29 17:58:55 800A 2014-01-29 17:59:10 138.0 NaN False True 4719848.0
6 2014-01-29 17:59:00 800A 2014-01-29 17:59:10 138.0 96.0 False True 4719848.0
7 2014-01-29 17:59:05 800A 2014-01-29 17:59:10 138.0 96.0 False True 4719848.0
8 2014-01-29 17:59:10 800A 2014-01-29 17:59:10 138.0 96.0 False True 4719848.0
9 2014-01-29 17:59:15 Barber 2014-01-29 18:17:15 138.0 96.0 False True 4719848.0
10 2014-01-29 17:59:20 Barber 2014-01-29 18:17:15 138.0 96.0 False True 4719848.0
11 2014-01-29 17:59:25 Barber 2014-01-29 18:17:15 138.0 95.0 False True 4719848.0
12 2015-03-04 21:19:45 800A 2015-03-05 01:00:15 NaN NaN False False NaN
13 2015-03-05 00:53:10 800A 2015-03-05 01:00:15 NaN NaN False False NaN
14 2015-03-05 00:55:40 800A 2015-03-05 01:00:15 NaN 95.0 False True 4163407.0
15 2015-03-05 00:55:45 800A 2015-03-05 01:00:15 NaN 95.0 False True 4163407.0
16 2015-03-05 00:55:50 800A 2015-03-05 01:00:15 NaN 94.0 False True 4163407.0
17 2015-03-05 00:55:55 800A 2015-03-05 01:00:15 81.0 93.0 False True 4163407.0
Upvotes: 1