Parsyk
Parsyk

Reputation: 341

remove rows with duplicated values of two columns that are shifted in pandas data frame?

I have a following dataframe:

import pandas as pd

data = [['1044', '1924'], ['1044', '1926'], ['1044', '1927'], ['1044', '1928'], ['1048', '1924'], ['1048', '1926'], ['1048', '1927'], ['1048', '1928'], ['1051', '1924'], ['1051', '1926'], ['1051', '1927'], ['1051', '1928'], ['1058', '']]  
df = pd.DataFrame(data, columns = ['Col1', 'Col2']) 

enter image description here

I would like to reduce the dataframe as shown here:

enter image description here

I have no clue how to do that, but it should be working as follows:

Hopefully it does exist a easier way how to perform it.

Upvotes: 3

Views: 45

Answers (3)

mozway
mozway

Reputation: 260825

Here is another method that should work irrespective of the order, with an arbitrary number of columns, and which should be easy to customize if more complex selections are needed:

(df.assign(x=df.groupby('Col1').ngroup(),
           y=df.groupby('Col1').cumcount())
  .query('x==y') # this can be updated in case more complex selections are required
  .drop(['x','y'], axis=1)
)

Output:

   Col1  Col2
0  1044  1924
4  1048  1926
8  1051  1927

edit: method for non "product-like" data

df.assign(x=df.groupby('Col1', dropna=False).ngroup(),
          y=df.groupby('Col1', dropna=False).cumcount())
  .query('x>=y')
  .drop_duplicates('Col1', keep='last')
  .drop(['x','y'], axis=1)
)

Output:

    Col1  Col2
0   1044  1924
5   1048  1926
10  1051  1927
12  1058      

NB. I saved x and y as intermediate columns here to be able to see how it works and to debug, but you can also use those operations directly for slicing

(df[df.groupby('Col1', dropna=False).ngroup().ge(
    df.groupby('Col1', dropna=False).cumcount())]
  .drop_duplicates('Col1', keep='last')
)

Upvotes: 2

wwnde
wwnde

Reputation: 26676

I think, cumcount each column and check similarities between the outputs, Code below;

df =df.assign(Date_cumcount=df.groupby('Date').cumcount(),Value_cumcount = df.groupby('Value').cumcount())
out =df[df['Date_cumcount']==df['Value_cumcount']].drop(['Date_cumcount','Value_cumcount'],1)



     Date Value
0   1044  1924
5   1048  1926
10  1051  1927
12  1058  

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28689

You can build a new dataframe instead, with the unique values:

pd.DataFrame({key:pd.unique(value) for key, value in df.items()})
Out[252]: 
   Col1  Col2
0  1044  1924
1  1048  1926
2  1051  1927

Upvotes: 2

Related Questions