Reputation: 73
I have a pandas dataframe, something like :
df = pd.read_csv('fruit.csv')
print(df)
fruitname quant
0 apple 10
1 apple 11
2 apple 13
3 banana 10
4 banana 20
5 banana 30
6 banana 40
7 pear 10
8 pear 102
9 pear 1033
10 pear 1012
11 pear 101
12 pear 100
13 pear 1044
14 orange 10
I want to remove the last entry, PER FRUIT, if that fruit has an odd (uneven) number of entries (%2 == 1). Without looping through the dataframe. So the end result of the above would be:
-- remove the last apple, since apple occurs 3 times -- remove the last pear -- remove the last (only) orange
resulting in:
fruitname quant
0 apple 10
1 apple 11
2 banana 10
3 banana 20
4 banana 30
5 banana 40
6 pear 10
7 pear 102
8 pear 1033
9 pear 1012
10 pear 101
11 pear 100
Is this possible? Or do I have to loop through the DF? I've been googling for 4 days, and just can't figure out how to do this.
Upvotes: 7
Views: 2823
Reputation: 1467
You could use the apply function:
def remove_last_odd_row(fr):
nrow = fr.shape[0]
if nrow % 2 > 0:
return fr[:(nrow - 1)]
else:
return fr
fr = fr.groupby("fruitname").apply(remove_last_odd_row).reset_index(drop=True)
Upvotes: 1
Reputation: 1528
I'm not very familiar with pandas but here is an answer.
for fruit in pd.unique(df.fruitname):
df1=df[df.fruitname==fruit]
if len(df1)%2 == 1:
df=df.drop(df1.last_valid_index())
Upvotes: 0
Reputation: 394159
Determine the number of items per fruit using value_counts
and build a list of them based on whether there are an odd number.
We can achieve this by just using %
modulus operator to generate either a 1
or 0
, cast this using astype
to create a boolean mask.
Use the boolean mask to mask the index of value_counts
.
Now you have a list of fruit, iterate over each fruit by filtering the df and get the last index label using iloc[-1]
and .name
attribute and append this to a list.
Now drop
these labels in the list:
In [393]:
fruits = df['fruitname'].value_counts().index[(df['fruitname'].value_counts() % 2).astype(bool)]
idx = []
for fruit in fruits:
idx.append(df[df['fruitname']==fruit].iloc[-1].name)
df.drop(idx)
Out[393]:
fruitname quant
0 apple 10
1 apple 11
3 banana 10
4 banana 20
5 banana 30
6 banana 40
7 pear 10
8 pear 102
9 pear 1033
10 pear 1012
11 pear 101
12 pear 100
Breaking the above down:
In [394]:
df['fruitname'].value_counts()
Out[394]:
pear 7
banana 4
apple 3
orange 1
Name: fruitname, dtype: int64
In [398]:
df['fruitname'].value_counts() % 2
Out[398]:
pear 1
banana 0
apple 1
orange 1
Name: fruitname, dtype: int64
In [399]:
fruits = df['fruitname'].value_counts().index[(df['fruitname'].value_counts() % 2).astype(bool)]
fruits
Out[399]:
Index(['pear', 'apple', 'orange'], dtype='object')
In [401]:
for fruit in fruits:
print(df[df['fruitname']==fruit].iloc[-1].name)
13
2
14
Actually you can use last_valid_index
instead of iloc[-1].name
so the following would work:
fruits = df['fruitname'].value_counts().index[(df['fruitname'].value_counts() % 2).astype(bool)]
idx = []
for fruit in fruits:
idx.append(df[df['fruitname']==fruit].last_valid_index())
df.drop(idx)
Upvotes: 6
Reputation: 353359
An alternative approach to EdChum's, which uses groupby
:
>>> grouped = df.groupby("fruitname")["fruitname"]
>>> lengths = grouped.transform(len)
>>> df.loc[~((lengths % 2 == 1) & (grouped.cumcount() == lengths-1))]
fruitname quant
0 apple 10
1 apple 11
3 banana 10
4 banana 20
5 banana 30
6 banana 40
7 pear 10
8 pear 102
9 pear 1033
10 pear 1012
11 pear 101
12 pear 100
This works by using transform
(and cumcount
, which kind of behaves like a type of transform because it broadcasts up to the original index) to give us a frame-length series we can work with:
>>> lengths
0 3
1 3
2 3
3 4
4 4
5 4
6 4
7 7
8 7
9 7
10 7
11 7
12 7
13 7
14 1
Name: fruitname, dtype: object
>>> grouped.cumcount()
0 0
1 1
2 2
3 0
4 1
5 2
6 3
7 0
8 1
9 2
10 3
11 4
12 5
13 6
14 0
dtype: int64
Upvotes: 5