Reputation: 5381
I have a dataframe and I need to drop rows based on a counter.
The dataframe looks like:
column1 column2
id
1 0.974600 0.400304
2 0.499050 0.546998
3 0.245399 0.675422
4 0.109111 0.664372
4 0.715271 0.169065
4 0.274887 0.072359
4 0.331148 0.317341
5 0.404076 0.347777
5 0.717883 0.763131
The counter for this example has keys equal to the index values and values equal to the number or rows that need to be dropped for that index.
Counter({1: 1, 2: 1, 3: 1, 4: 2, 5: 1})
I've tried to drop the rows using a loop and but I'm getting an error.
for k,v in count.iteritems():
del t.ix[k][:v]
This is the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-102-33c0a6ba6f58> in <module>()
----> 1 del t.ix[k][:v]
2
C:\Anaconda2\lib\site-packages\pandas\core\generic.pyc in __delitem__(self, key)
1788 # there was no match, this call should raise the appropriate
1789 # exception:
-> 1790 self._data.delete(key)
1791
1792 # delete from the caches
C:\Anaconda2\lib\site-packages\pandas\core\internals.pyc in delete(self, item)
3647 Delete selected item (items if non-unique) in-place.
3648 """
-> 3649 indexer = self.items.get_loc(item)
3650
3651 is_deleted = np.zeros(self.shape[0], dtype=np.bool_)
C:\Anaconda2\lib\site-packages\pandas\core\indexes\base.pyc in get_loc(self, key, method, tolerance)
2391 key = _values_from_object(key)
2392 try:
-> 2393 return self._engine.get_loc(key)
2394 except KeyError:
2395 return self._engine.get_loc(self._maybe_cast_indexer(key))
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5239)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:4792)()
TypeError: 'slice(None, 2, None)' is an invalid key
How can I accomplish this tasks to have a final df that looks like:
column1 column2
id
4 0.274887 0.072359
4 0.331148 0.317341
5 0.717883 0.763131
Upvotes: 1
Views: 105
Reputation: 2110
If you want to avoid looping over the dataframe you can use merge to find the rows to drop:
df = df.reset_index()
df['grp_counter'] = df.groupby('id').cumcount()+1
id column1 column2 grp_counter
0 1 0.974600 0.400304 1
1 2 0.499050 0.546998 1
2 3 0.245399 0.675422 1
3 4 0.109111 0.664372 1
4 4 0.715271 0.169065 2
5 4 0.274887 0.072359 3
6 4 0.331148 0.317341 4
7 5 0.404076 0.347777 1
8 5 0.717883 0.763131 2
selector = pd.Series({1: 1, 2: 1, 3: 1, 4: 2, 5: 1}).rename('count_select').reset_index()
selector['keep'] = False
df = df[df.merge(selector, left_on=['id','grp_counter'], right_on=['index','count_select'], how='outer')['keep'].fillna(True)]
df = df.drop('grp_counter', axis=1).set_index('id')
column1 column2
id
4 0.109111 0.664372
4 0.274887 0.072359
4 0.331148 0.317341
5 0.717883 0.763131
Upvotes: 1
Reputation: 16079
Using del
on a DataFrame feels odd to me so I would like to avoid it if possible. To get around that I would recommend finding all rows of a given key and keeping the last rows.shape[0] - v
entries, dropping the rest.
df
col1 col2
1 0.974600 0.400304
2 0.499050 0.546998
3 0.245399 0.675422
4 0.109111 0.664372
4 0.715271 0.169065
4 0.274887 0.072359
4 0.331148 0.317341
5 0.404076 0.347777
5 0.717883 0.763131
df2 = df.copy()
for k, v in c.items():
rows = df2.loc[df2.index == k]
df2.drop(k, inplace=True)
if rows.shape[0] - v > 0:
retain = rows.iloc[:(rows.shape[0] - v)]
df2 = df2.append(retain)
df2
col1 col2
4 0.274887 0.072359
4 0.331148 0.317341
5 0.717883 0.763131
Upvotes: 0