Apply Across Dynamic Number of Columns

Question

I have a pandas dataframe and I want to make the last N columns null values. N is dependent on the value in another column.

Here is an example:

df = pd.DataFrame(np.random.randn(4, 5))
df['lookup_key'] = df.index     #(actual data does not use index here)
lkup_dict = {0:1,1:2,2:2,3:3}

In this DataFrame, I want to use the value in the 'lookup_key' column to determine which columns to set to null.

Row 0 -> df.ix[0,lkup_dict[0]:4] = np.nan       #key = 0, value = 1
Row 1 -> df.ix[1,lkup_dict[1]:4] = np.nan       #key = 1, value = 2
Row 2 -> df.ix[2,lkup_dict[2]:4] = np.nan       #key = 2, value = 2
Row 3 -> df.ix[3,lkup_dict[3]:4] = np.nan       #key = 3, value = 3

The end result looking like this:

      0         1         2   3   4  lookup_key
0 -0.882864       NaN       NaN NaN NaN           0
1  1.358663 -0.024898       NaN NaN NaN           1
2  0.885058  0.673621       NaN NaN NaN           2
3 -1.487506  0.031021 -1.313646 NaN NaN           3

In this example I have to manually type out the df.ix... for each row. I need something that will do this for all rows of my DataFrame

McMath · Accepted Answer

You can do this with a for loop. To demonstrate, I generate a DataFrame with some random values. I then insert a lookup_key column in the front with some random integers. Finally, I generate lkup_dict dictionary with some random values.

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
>>> df.insert(0, 'lookup_key', np.random.randint(0, 5, 10))
>>> print df

   lookup_key         A         B         C         D
0           0  0.048738  0.773304 -0.912366 -0.832459
1           3 -0.573221 -1.381395 -0.644223  1.888484
2           0  0.198043 -0.751243  0.138277  2.006188
3           2 -1.692605 -1.586282 -0.656690  0.647510
4           3 -0.847591 -0.368447  0.510250 -0.172055
5           1  0.927243 -0.447478  0.796221  0.372763
6           3  0.027285  0.177276  1.087456 -0.420614
7           4 -1.147004 -0.172367 -0.767347 -0.855318
8           1 -0.649695 -0.572409 -0.664149  0.863050
9           4 -0.820982 -0.499889 -0.624889  1.397271

>>> lkup_dict = {i: np.random.randint(0, 5) for i in range(5)}
>>> print lkup_dict

{0: 3, 1: 0, 2: 0, 3: 4, 4: 1}

Now I iterate over the rows in the DataFrame. key gets the value under the lookup_key column for that row. nNulls uses the key to get the number of null values from lkup_dict. startIndex gets the index for the first column with a null value in that row. The final line replaces the relevant values with null values.

>>> for i, row in df.iterrows():
...     key = row['lookup_key'].astype(int)
...     nNulls = lkup_dict[key]
...     startIndex = df.shape[1] - nNulls
...     df.loc[i, startIndex:] = np.nan
>>> print df

   lookup_key         A         B         C         D
0           0  0.048738       NaN       NaN       NaN
1           3       NaN       NaN       NaN       NaN
2           0  0.198043       NaN       NaN       NaN
3           2 -1.692605 -1.586282 -0.656690  0.647510
4           3       NaN       NaN       NaN       NaN
5           1  0.927243 -0.447478  0.796221  0.372763
6           3       NaN       NaN       NaN       NaN
7           4 -1.147004 -0.172367 -0.767347       NaN
8           1 -0.649695 -0.572409 -0.664149  0.863050
9           4 -0.820982 -0.499889 -0.624889       NaN

That's it. Hopefully that's what you're looking for.

Apply Across Dynamic Number of Columns

Answers (1)

Related Questions