Reputation: 5529
I have a pandas dataframe df
which contains a column amount
. For many rows, the amount
is zero. I want to randomly remove 50% of the rows where the amount
is zero, keeping all rows where amount
is nonzero. How can I do this?
Upvotes: 3
Views: 1481
Reputation: 30424
A minor tweak on @piRSquared's answer (using a boolean selection instead of query):
df.drop( df[df.amount == 0].sample(frac=.5).index )
It's about twice as fast as using query, but 3x slower than the numpy way.
Upvotes: 2
Reputation: 294228
pandas
Using query
+ sample
df.drop(df.query('amount == 0').sample(frac=.5).index)
Consider the dataframe df
df = pd.DataFrame(dict(amount=[0, 1] * 10))
df.drop(df.query('amount == 0').sample(frac=.5).index)
numpy
iszero = df.amount.values == 0
count_zeros = iszero.sum()
idx = np.arange(iszero.shape[0])
keep_these = np.random.choice(idx[iszero], int(iszero.sum() * .5), replace=False)
df.iloc[np.sort(np.concatenate([idx[~iszero], keep_these]))]
amount
1 1
2 0
3 1
5 1
6 0
7 1
8 0
9 1
10 0
11 1
12 0
13 1
15 1
17 1
19 1
time test
Per the comment from @tomcy, you can use the parameter inplace=True
to remove the rows from df
without having to reassign df
df.drop(df.query('amount == 0').sample(frac=.5).index, inplace=True)
df
amount
1 1
2 0
3 1
5 1
6 0
7 1
8 0
9 1
10 0
11 1
12 0
13 1
15 1
17 1
19 1
Upvotes: 3