How to randomly select some pandas dataframe rows?

Question

I have a pandas dataframe df which contains a column amount. For many rows, the amount is zero. I want to randomly remove 50% of the rows where the amount is zero, keeping all rows where amount is nonzero. How can I do this?

piRSquared · Accepted Answer

`pandas`

Using query + sample

df.drop(df.query('amount == 0').sample(frac=.5).index)

Consider the dataframe df

df = pd.DataFrame(dict(amount=[0, 1] * 10))

df.drop(df.query('amount == 0').sample(frac=.5).index)

`numpy`

iszero = df.amount.values == 0
count_zeros = iszero.sum()
idx = np.arange(iszero.shape[0])
keep_these = np.random.choice(idx[iszero], int(iszero.sum() * .5), replace=False)

df.iloc[np.sort(np.concatenate([idx[~iszero], keep_these]))]

time test

Per the comment from @tomcy, you can use the parameter inplace=True to remove the rows from df without having to reassign df

df.drop(df.query('amount == 0').sample(frac=.5).index, inplace=True)
df

    amount
1        1
2        0
3        1
5        1
6        0
7        1
8        0
9        1
10       0
11       1
12       0
13       1
15       1
17       1
19       1

How to randomly select some pandas dataframe rows?

Answers (2)

`pandas`

`numpy`

Related Questions