royco
royco

Reputation: 5529

How to randomly select some pandas dataframe rows?

I have a pandas dataframe df which contains a column amount. For many rows, the amount is zero. I want to randomly remove 50% of the rows where the amount is zero, keeping all rows where amount is nonzero. How can I do this?

Upvotes: 3

Views: 1481

Answers (2)

JohnE
JohnE

Reputation: 30424

A minor tweak on @piRSquared's answer (using a boolean selection instead of query):

df.drop( df[df.amount == 0].sample(frac=.5).index )

It's about twice as fast as using query, but 3x slower than the numpy way.

Upvotes: 2

piRSquared
piRSquared

Reputation: 294228

pandas

Using query + sample

df.drop(df.query('amount == 0').sample(frac=.5).index)

Consider the dataframe df

df = pd.DataFrame(dict(amount=[0, 1] * 10))

df.drop(df.query('amount == 0').sample(frac=.5).index)

numpy

iszero = df.amount.values == 0
count_zeros = iszero.sum()
idx = np.arange(iszero.shape[0])
keep_these = np.random.choice(idx[iszero], int(iszero.sum() * .5), replace=False)

df.iloc[np.sort(np.concatenate([idx[~iszero], keep_these]))]

    amount
1        1
2        0
3        1
5        1
6        0
7        1
8        0
9        1
10       0
11       1
12       0
13       1
15       1
17       1
19       1

time test

enter image description here

Per the comment from @tomcy, you can use the parameter inplace=True to remove the rows from df without having to reassign df

df.drop(df.query('amount == 0').sample(frac=.5).index, inplace=True)
df

    amount
1        1
2        0
3        1
5        1
6        0
7        1
8        0
9        1
10       0
11       1
12       0
13       1
15       1
17       1
19       1

Upvotes: 3

Related Questions