thedatasleuth
thedatasleuth

Reputation: 549

How to leave NaN behind after shifting over

I have a function that shifts the values of one column (Col_5) into another column (Col_6) if that column (Col_6) is blank, like this:

def shift(row):
    return row['Col_6'] if not pd.isnull(row['Col_6']) else row['Col_5']

I then apply this function to my columns like this:

df[['Col_6', 'Col_5']].apply(shift, axis=1)

This works fine, but instead of leaving the original value in Col_5, I need it to shift to Col_6 and in its place, leave a np.nan (so I can apply the same function to the preceeding column.) Thoughts?

Upvotes: 2

Views: 107

Answers (3)

user3483203
user3483203

Reputation: 51165

Setup (using the setup from @cosmic_inquiry)

df = pd.DataFrame({'Col_5':[1, np.nan, 3, 4, np.nan],
                   'Col_6':[np.nan, 8, np.nan, 6, np.nan]})

You can look at this problem like a basic swap operation with a mask

numpy.flip + numpy.isnan

a = df[['Col_5', 'Col_6']].values
m = np.isnan(a[:, 1])
a[m] = np.flip(a[m], axis=1)
df[['Col_5', 'Col_6']] = a

np.isnan + loc:

m = np.isnan(df['Col_6'])
df.loc[m, ['Col_5', 'Col_6']] = df.loc[m, ['Col_6', 'Col_5']].values

   Col_5  Col_6
0    NaN    1.0
1    NaN    8.0
2    NaN    3.0
3    4.0    6.0
4    NaN    NaN

Performance

test_df = \
    pd.DataFrame(np.random.choice([1, np.nan], (1_000_000, 2)), columns=['Col_5', 'Col_6'])

In [167]: %timeit chris(test_df)
68.3 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [191]: %timeit chris2(test_df)
43.9 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [168]: %timeit jpp(test_df)
86.7 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [169]: %timeit cosmic(test_df)
130 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 0

cosmic_inquiry
cosmic_inquiry

Reputation: 2684

import pandas as pd
import numpy as np
df = pd.DataFrame({'Col_5':[1, np.nan, 3, 4, np.nan],
                   'Col_6':[np.nan, 8, np.nan, 6, np.nan]})
col_5 = df['Col_5'].copy()
df.loc[pd.isnull(df['Col_6']), 'Col_5'] = np.nan
df.loc[pd.isnull(df['Col_6']), 'Col_6'] = col_5

Output:

# Original Dataframe:
   Col_5  Col_6
0    1.0    NaN
1    NaN    8.0
2    3.0    NaN
3    4.0    6.0
4    NaN    NaN
# Fill Col_5 with NaN where Col_6 is NaN:
   Col_5  Col_6
0    NaN    NaN
1    NaN    8.0
2    NaN    NaN
3    4.0    6.0
4    NaN    NaN
# Assign the original col_5 values to Col_6:
   Col_5  Col_6
0    NaN    1.0
1    NaN    8.0
2    NaN    3.0
3    4.0    6.0
4    NaN    NaN

Upvotes: 1

jpp
jpp

Reputation: 164753

fillna + mask: vectorise, not row-wise

With Pandas, you should try to avoid row-wise operations via apply, as these are processed via Python-level loops. In this case, you can use:

null_mask = df['Col_6'].isnull()
df['Col_6'] = df['Col_6'].fillna(df['Col_5'])
df['Col_5'] = df['Col_5'].mask(null_mask)

Notice we calculate and store a Boolean series representing where Col_6 is null first, then use it later to make those values null where values have been moved across via fillna.

Upvotes: 2

Related Questions