xzk
xzk

Reputation: 877

Panda Dataframe Find rows which does not have equivalent value in the DataFrame

DataFrame:

    column1     column2
0   some_data   string1
1   some_data   string1
2   some_data   string2
3   some_data   string3
4   some_data   string2
5   some_data   string4
5   some_data   string4
...
20k+ rows in total

Explanation: For most rows, column2 data appear in pairs. I want to find out rows that do not have paired data (e.g. string3)

Expected Output:

   column1    column2
0   some_data  string3

Any solutions to find out such rows? thanks!

Upvotes: 0

Views: 44

Answers (2)

mozway
mozway

Reputation: 261000

Here is an alternative to take only a pair number of rows, working as a pipeline:

(df.groupby('column2', as_index=False)
   .apply(lambda d: d.iloc[:2*(len(d)//2)]) # transform any odd to previous even
   .droplevel(0)
)

output:

     column1  column2
0  some_data  string1
1  some_data  string1
2  some_data  string2
4  some_data  string2
5  some_data  string4
5  some_data  string4

Upvotes: 0

jezrael
jezrael

Reputation: 862901

If possible simplify problem for found all rows without dupes by column2 use:

df1 = df[~df['column2'].duplicated(keep=False)]

If need test counts and filter all rows without pairs (2):

df2 = df[df.groupby('column2')['column2'].transform('size').ne(2)]

Also if need test all pairs, it means 2, 4, 6, 8... use:

df3 = df[df.groupby('column2')['column2'].transform('size') % 2 == 1]

Upvotes: 1

Related Questions