Lee
Lee

Reputation: 21

pandas dataframe remove duplicate for column A when column B is nan

so the situation is I got a dataframe [A, B, C, D], when A B both column are not nan(A cannot be nan) don't remove anything, but when we have some combination of [A,B], still got another row that A is not null, B is null. then need to remove such row. Another situation is when we don't have any AB combination, we only got one row that A is not null, but B is null, at this time, cannot remove duplicate.

e.g.

  A    B     C   D
[Tom, Jane, cat, dog],
[Tom, Zack, monkey, sheep],
[Tom, Nan, fish, dolphine]

so in this case the first and second column shouldn't be removed, but the third column because Tom(in column A already exist) and in column B the value is Nan, so the third column should be removed.

Another situation is

 A    B     C   D
[Jack, Nan, fish, dolphine]

in the whole dataframe, we only have one row that the value in column A is Jack, so no matter B is nan or not, we don't remove this column.

Upvotes: 1

Views: 1019

Answers (3)

ffi23
ffi23

Reputation: 121

Just a slight edit to @Brennan's excellent answer, for some reason for me this

(row[nan_col] is np.nan)

did not work, since it was not picking up NaN values?

I had to use

(pandas.isnull(row[nan_col]))

so the solution became

df = df[df.apply(lambda row: 
         not((pandas.isnull(row[nan_col])) 
             & 
             (len(df[df['A'] == row[dup_col]]) > 1)
            ), axis=1)]

Upvotes: 0

Brennan
Brennan

Reputation: 514

You can achieve your desired result using one line:

df = df[df.apply(lambda row: not((row['B'] is np.nan) & (len(df[df['A'] == row[dup_col]]) > 1)), axis=1)]

Details

The solution here is to use df.apply() in conjunction with python's lambda function.

Setup

import pandas as pd
import numpy as np

data = {
    'A':['Tom', 'Tom', 'Tom','Jack'],
    'B':['Jane', 'Zack', np.nan,np.nan],
    'C':['Jane', 'Bear' , 'Cat','Bear'],
    'D':['Jane', 'Bear' , 'Cat','Bear'],
    }

# Create the data frame
df = pd.DataFrame(data)

# Set columns to check for duplicate and np.nan
dup_col = 'A'
nan_col = 'B'

# Print df before filter
print(df.head())

      A     B     C     D
0   Tom  Jane  Jane  Jane
1   Tom  Zack  Bear  Bear
2   Tom   NaN   Cat   Cat
3  Jack   NaN  Bear  Bear

Use df.apply to apply a function to the along the axis, specifying axis=1 to apply the function to each row.

  • the lambda function lets us utilize the row variable
  • conditions inside are what you define as duplicate
  • i.e. the 'B' col is Nan and the 'A' is a duplicate

I broke this into multiple lines to make it easier to follow, but it can really be done on one line.


df = df[
    df.apply(lambda row: 
    not(
        (row[nan_col] is np.nan) & (len(df[df[dup_col] == row[dup_col]]) > 1)
    ), axis=1)
    ]

# Print after filter
print(df.head())
      A     B     C     D
0   Tom  Jane  Jane  Jane
1   Tom  Zack  Bear  Bear
3  Jack   NaN  Bear  Bear

Upvotes: 1

Joao Donasolo
Joao Donasolo

Reputation: 379

this is the solution I found:

is_na = df['B'].isna() #This transformation (NaN -> True/False) is necessary to count

new_df = df[is_na].filter(['A'])
new_df['B'] = is_na #new_df has column A and column B with Trues and Falses
counting_nans = new_df.groupby('A')['B'].count() 

counting_nans has the number of NaNs grouped by values of column A:

>>> df

    A   B   C   D
0   Tom     Jane    Cat     Bear
1   Tom     Jenny   Monkey  Tortue
2   Tom     NaN     Fish    Cow
3   Zac     NaN     Dog     Penguin
>>> counting_nans

A
Tom    1
Zac    1
Name: B, dtype: int64 

In uniques, we will store all the values that must be evaluated.

uniques = df['A'].value_counts()
>>> uniques

Tom    3
Zac    1
Name: A, dtype: int64 

Now, let's filter this out. If the number of times a value appears in column 'A' is equal to the number of NaNs in column 'B', the rows should not be deleted, and if it appears only once in 'A', we can take this out too (doesn't matter if at that specific row df['B'] is NaN or not)

uniques.sort_index(inplace=True)
counting_nans.sort_index(inplace=True)

uniques = uniques[ uniques != counting_nans]
uniques = uniques[ uniques > 1 ]

condition = df['A'].isin(uniques.index) & df['B'].isna() 
#This is an array with Trues when df['A'] is in values to be evaluated and df['B'] is NaN
index_condition = condition.loc[condition == True].index #These are the indexes

df.drop(index_condition, inplace=True) #This eliminates the rows
>>> df

     A      B       C        D
0  Tom   Jane     Cat     Bear
1  Tom  Jenny  Monkey   Tortue
3  Zac    NaN     Dog  Penguin

Hope it helps! Let me know if my code is not clear. Also, I'm pretty sure there is an easier way, I'm kinda new to programming xD

Upvotes: 0

Related Questions