Daniel
Daniel

Reputation: 5381

delete rows by duplicate indexes

I have a dataframe and I need to drop rows based on a counter.

The dataframe looks like:

    column1     column2
id      
1   0.974600    0.400304
2   0.499050    0.546998
3   0.245399    0.675422
4   0.109111    0.664372
4   0.715271    0.169065
4   0.274887    0.072359
4   0.331148    0.317341
5   0.404076    0.347777
5   0.717883    0.763131

The counter for this example has keys equal to the index values and values equal to the number or rows that need to be dropped for that index.

Counter({1: 1, 2: 1, 3: 1, 4: 2, 5: 1})

I've tried to drop the rows using a loop and but I'm getting an error.

for k,v in count.iteritems():
    del t.ix[k][:v]

This is the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-102-33c0a6ba6f58> in <module>()
----> 1 del t.ix[k][:v]
      2 

C:\Anaconda2\lib\site-packages\pandas\core\generic.pyc in __delitem__(self, key)
   1788             # there was no match, this call should raise the appropriate
   1789             # exception:
-> 1790             self._data.delete(key)
   1791 
   1792         # delete from the caches

C:\Anaconda2\lib\site-packages\pandas\core\internals.pyc in delete(self, item)
   3647         Delete selected item (items if non-unique) in-place.
   3648         """
-> 3649         indexer = self.items.get_loc(item)
   3650 
   3651         is_deleted = np.zeros(self.shape[0], dtype=np.bool_)

C:\Anaconda2\lib\site-packages\pandas\core\indexes\base.pyc in get_loc(self, key, method, tolerance)
   2391             key = _values_from_object(key)
   2392             try:
-> 2393                 return self._engine.get_loc(key)
   2394             except KeyError:
   2395                 return self._engine.get_loc(self._maybe_cast_indexer(key))

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5239)()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:4792)()

TypeError: 'slice(None, 2, None)' is an invalid key

How can I accomplish this tasks to have a final df that looks like:

    column1     column2
id      
4   0.274887    0.072359
4   0.331148    0.317341
5   0.717883    0.763131

Upvotes: 1

Views: 105

Answers (2)

P.Tillmann
P.Tillmann

Reputation: 2110

If you want to avoid looping over the dataframe you can use merge to find the rows to drop:

df = df.reset_index()
df['grp_counter'] = df.groupby('id').cumcount()+1

   id   column1   column2  grp_counter
0   1  0.974600  0.400304            1
1   2  0.499050  0.546998            1
2   3  0.245399  0.675422            1
3   4  0.109111  0.664372            1
4   4  0.715271  0.169065            2
5   4  0.274887  0.072359            3
6   4  0.331148  0.317341            4
7   5  0.404076  0.347777            1
8   5  0.717883  0.763131            2

selector = pd.Series({1: 1, 2: 1, 3: 1, 4: 2, 5: 1}).rename('count_select').reset_index()
selector['keep'] = False 
df = df[df.merge(selector, left_on=['id','grp_counter'], right_on=['index','count_select'], how='outer')['keep'].fillna(True)]
df = df.drop('grp_counter', axis=1).set_index('id')

     column1   column2
id                    
4   0.109111  0.664372
4   0.274887  0.072359
4   0.331148  0.317341
5   0.717883  0.763131

Upvotes: 1

Grr
Grr

Reputation: 16079

Using del on a DataFrame feels odd to me so I would like to avoid it if possible. To get around that I would recommend finding all rows of a given key and keeping the last rows.shape[0] - v entries, dropping the rest.

df
       col1      col2
1  0.974600  0.400304
2  0.499050  0.546998
3  0.245399  0.675422
4  0.109111  0.664372
4  0.715271  0.169065
4  0.274887  0.072359
4  0.331148  0.317341
5  0.404076  0.347777
5  0.717883  0.763131

df2 = df.copy()
for k, v in c.items():
    rows = df2.loc[df2.index == k]
    df2.drop(k, inplace=True)
    if rows.shape[0] - v > 0:
        retain = rows.iloc[:(rows.shape[0] - v)]
        df2 = df2.append(retain)

df2
       col1      col2
4  0.274887  0.072359
4  0.331148  0.317341
5  0.717883  0.763131

Upvotes: 0

Related Questions