Abalan Musk
Abalan Musk

Reputation: 57

backfill pandas dataframe column using a condition

I have a pandas dataframe with 50 million records and what I am trying to do is backfill based on a condition. As we can see that the timestamps for name 800A and Barber align so I assume that the data belongs to same name and it is just an error while recording the data. The same goes with name Mia.

This is just the sample data.

my dataframe looks like this.

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

Now I am trying to backfill the record numbers(rec_num) column until it maps the bool condition False False in both the x_inc and vs_inc columns.

Actual output:

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783 01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783 01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783 01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783 01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783 01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923 03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923 03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923 03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923 03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923 03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

Expected output:

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783 01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783 01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783 01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923 03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923 03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923 03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

I am using df['rec_num'].fillna(method='bfill') but it fills completely which is not my ideal solution. I would appreciate if I can get any suggestions to this problem(or if there is any better approach). Thanks in advance.

Upvotes: 4

Views: 1559

Answers (2)

anky
anky

Reputation: 75080

Using a boolean mask and np.where() you can use this:

cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)

Note : you can reassign the values to the old column named rec_num instead of creating a new column. I added that so you could compare. Also this should be the fastest method since vectorized

    datetime            name    dischargeDate       HR      Sp      x_inc   vs_inc  rec_num     new_rec
0   2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0   98.0    False   True    6458445.0   6458445.0
1   2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0   97.0    False   True    6458445.0   6458445.0
2   2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN     NaN     False   False   NaN         NaN
3   2029-01-01 17:58:45 800A    2029-01-14 17:59:10 NaN     NaN     False   False   NaN         NaN
4   2029-01-01 17:58:50 800A    2029-01-14 17:59:10 139.0   NaN     False   True    NaN         7192783.0
5   2029-01-01 17:58:55 800A    2029-01-14 17:59:10 138.0   NaN     False   True    NaN         7192783.0
...........................................................
...........................................................
....................................................
.....................................

Upvotes: 4

Andre Araujo
Andre Araujo

Reputation: 2400

You can use apply.

Create the function:

def foo(x):
    if not bool(x['epic_include']) and not bool(x['vs_include']):
        return None
    else:
        if not pd.isna(x['twist_mrn']):
            return x['twist_mrn']
        else:
            return df['twist_mrn'].iloc[df.iloc[x.name:]['twist_mrn'].first_valid_index()]

So, apply:

df['twist_mrn'] = df.apply(foo, axis=1)

Output:

    datetime    patient_name    dischargeDate   HR  SpO2    epic_include    vs_include  twist_mrn
0   2014-01-05 18:04:50     Zawisza     2014-01-05 18:05:00     119.0   98.0    False   True    4654843.0
1   2014-01-05 18:04:55     Zawisza     2014-01-05 18:05:00     120.0   97.0    False   True    4654843.0
2   2014-01-05 18:05:00     Zawisza     2014-01-05 18:05:00     NaN     NaN     False   False   NaN
3   2014-01-29 17:58:45     800A    2014-01-29 17:59:10     NaN     NaN     False   False   NaN
4   2014-01-29 17:58:50     800A    2014-01-29 17:59:10     139.0   NaN     False   True    4719848.0
5   2014-01-29 17:58:55     800A    2014-01-29 17:59:10     138.0   NaN     False   True    4719848.0
6   2014-01-29 17:59:00     800A    2014-01-29 17:59:10     138.0   96.0    False   True    4719848.0
7   2014-01-29 17:59:05     800A    2014-01-29 17:59:10     138.0   96.0    False   True    4719848.0
8   2014-01-29 17:59:10     800A    2014-01-29 17:59:10     138.0   96.0    False   True    4719848.0
9   2014-01-29 17:59:15     Barber  2014-01-29 18:17:15     138.0   96.0    False   True    4719848.0
10  2014-01-29 17:59:20     Barber  2014-01-29 18:17:15     138.0   96.0    False   True    4719848.0
11  2014-01-29 17:59:25     Barber  2014-01-29 18:17:15     138.0   95.0    False   True    4719848.0
12  2015-03-04 21:19:45     800A    2015-03-05 01:00:15     NaN     NaN     False   False   NaN
13  2015-03-05 00:53:10     800A    2015-03-05 01:00:15     NaN     NaN     False   False   NaN
14  2015-03-05 00:55:40     800A    2015-03-05 01:00:15     NaN     95.0    False   True    4163407.0
15  2015-03-05 00:55:45     800A    2015-03-05 01:00:15     NaN     95.0    False   True    4163407.0
16  2015-03-05 00:55:50     800A    2015-03-05 01:00:15     NaN     94.0    False   True    4163407.0
17  2015-03-05 00:55:55     800A    2015-03-05 01:00:15     81.0    93.0    False   True    4163407.0

Upvotes: 1

Related Questions