Auren Ferguson
Auren Ferguson

Reputation: 489

Pandas: shifting columns depending on if NaN or not

I have a dataframe like so:

phone_number_1_clean    phone_number_2_clean    phone_number_3_clean
                 NaN                     NaN                 8546987
             8316589                 8751369                     NaN
             4569874                     NaN                 2645981

I would like phone_number_1_clean to be as populated as possible. This will require shifting either phone_number_2_clean or phone_number_3_clean to phone_number_1_clean and vice versa meaning getting phone_number_2_clean as populated as possible if phone_number_1_clean is populated etc.

The output should look something like:

phone_number_1_clean    phone_number_2_clean    phone_number_3_clean
             8546987                     NaN                     NaN
             8316589                 8751369                     NaN
             4569874                 2645981                     NaN

I might be able to do it np.wherestatements but could be messy.

The approach would preferably be vectorised as will be applied to large-ish dataframes.

Upvotes: 7

Views: 2095

Answers (1)

jezrael
jezrael

Reputation: 863056

Use:

#for each row remove NaNs and create new Series - rows in final df 
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
  phone_number_1_clean phone_number_2_clean  phone_number_3_clean
0              8546987                  NaN                   NaN
1              8316589              8751369                   NaN
2              4569874              2645981                   NaN

Or:

s = df.stack()
s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]

df1 = s.unstack().reindex(columns=range(len(df.columns)))
df1.columns = df.columns
print (df1)
  phone_number_1_clean phone_number_2_clean  phone_number_3_clean
0              8546987                  NaN                   NaN
1              8316589              8751369                   NaN
2              4569874              2645981                   NaN

Or a bit changed justify function:

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = pd.notnull(a) #changed to pandas notnull
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val, dtype=object) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

df = pd.DataFrame(justify(df.values, invalid_val=np.nan),  
                  index=df.index, columns=df.columns)
print (df)
  phone_number_1_clean phone_number_2_clean phone_number_3_clean
0              8546987                  NaN                  NaN
1              8316589              8751369                  NaN
2              4569874              2645981                  NaN

Performance:

#3k rows
df = pd.concat([df] * 1000, ignore_index=True)

In [442]: %%timeit
     ...: df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
     ...: #if possible different number of columns like original df is necessary reindex
     ...: df1 = df1.reindex(columns=range(len(df.columns)))
     ...: #assign original columns names
     ...: df1.columns = df.columns
     ...: 
1.17 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [443]: %%timeit
     ...: s = df.stack()
     ...: s.index = [s.index.get_level_values(0), s.groupby(level=0).cumcount()]
     ...: 
     ...: df1 = s.unstack().reindex(columns=range(len(df.columns)))
     ...: df1.columns = df.columns
     ...: 
     ...: 
5.88 ms ± 74.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [444]: %%timeit
     ...: pd.DataFrame(justify(df.values, invalid_val=np.nan),
          index=df.index, columns=df.columns)
     ...: 
941 µs ± 131 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 6

Related Questions