Make42
Make42

Reputation: 13108

Randomly insert NA's values in a pandas dataframe - with no rows completely missing

How can I randomly make some values missing in a panda dataframe, as in Randomly insert NA's values in a pandas dataframe but make sure no row is set completely with missing values?

Edit: Sorry for not stating this explicitly again (it was in the question I referenced though): I need to be able to specify how much percentage, for example 10%, of the cells is supposed to be NaN (or rather, as close to 10% as can be achieved with the existing data frame's size), as opposed to, say, clearing cells independently with a marginal per-cell probability of 10%.

Upvotes: 6

Views: 6494

Answers (3)

jezrael
jezrael

Reputation: 863166

You can use DataFrame.mask and for numpy boolean mask is used answer of this my question:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

print (df)
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

np.random.seed(100)
mask = np.random.choice([True, False], size=df.shape)
print (mask)
[[ True  True False]
 [False False False]
 [ True  True  True]] -> problematic values - all True

mask[mask.all(1),-1] = 0
print (mask)
[[ True  True False]
 [False False False]
 [ True  True False]]

print (df.mask(mask))
     A    B  C
0  NaN  NaN  7
1  2.0  5.0  8
2  NaN  NaN  9

Upvotes: 5

AndreyF
AndreyF

Reputation: 1838

Here is an answer based on Randomly insert NA's values in a pandas dataframe:

replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.1*len(ix)))
for row, col in ix:
    if len(replaced[row]) < df.shape[1] - 1:
        df.iloc[row, col] = np.nan
        to_replace -= 1
        replaced[row].add(col)
        if to_replace == 0:
            break

The shuffle operation will cause random order to the indexes and the if clause will avoid replacing the entire row.

Upvotes: 1

AndreyF
AndreyF

Reputation: 1838

How about applying a function that will replace random columns' values. To avoid replacing the entire row it is possible to draw a number between 0 and n-1 of values to replace.

import random

def add_random_na(row):
    vals = row.values
    for _ in range(random.randint(0,len(vals)-2)):
        i = random.randint(0,len(vals)-1)
        vals[i] = np.nan
    return vals

df = df.apply(add_random_na,axis=1)

Upvotes: 1

Related Questions