Reputation: 85663
I can not find a way of getting this done.
I have a DataFrame with n rows .
Some indexes id are duplicated (repeated twice or more).
This is a simplified example:
>>> import pandas as p
>>> df = p.DataFrame({'id':['a','a','b', 'b'], 'test':['e','y','g', 's'], 'data':['re','rf','rg', 'rn']})
>>> df.set_index('id')
test data
id
a e re
a y rf
b g rg
b s rn
I need to remove duplicate indexes to keep just one row. Dont mind if it is the first, the last or the middle one, except when the value in the test column is included in a list of preferred values.
preferred_values = ['x', 'y', 'z']
If this is the case, and considering my example, the row with the preferred value y must be kept.
Thus, a possible outcome would be:
test data
id
a y rf
b s rn
I have considered groupby
but I can not get the way to aggregate taking into account the condition of preferred values.
I also tried working with the dataframes created in the groupby process, like:
gb = df.groupby('id')
for name, group in gb:
if len(group) > 1:
print name
print group
# modify group here
Upvotes: 1
Views: 86
Reputation: 60756
I like your groupby approach. There may be an easier or more Pandas-esqe way, but this brute seems to get it done:
df2 = p.DataFrame()
for name, group in gb:
if len(group) > 1:
group['keeper'] = group.test.isin(preferred_values)
newindex = group[ group.keeper==True ].head(1).copy()
if len(newindex) < 1:
newindex = group.head(1).copy()
df2 = df2.append(newindex[['data','id','test']])
There's some implicit rules in there. If more than one index is in the preferred_values list, it takes the first one. If no index is in the preferred list, it takes the first one. I think that lines up with your logic.
Thanks for providing example data, it makes things so much easier.
Upvotes: 1