Reputation: 21
so the situation is I got a dataframe [A, B, C, D], when A B both column are not nan(A cannot be nan) don't remove anything, but when we have some combination of [A,B], still got another row that A is not null, B is null. then need to remove such row. Another situation is when we don't have any AB combination, we only got one row that A is not null, but B is null, at this time, cannot remove duplicate.
e.g.
A B C D
[Tom, Jane, cat, dog],
[Tom, Zack, monkey, sheep],
[Tom, Nan, fish, dolphine]
so in this case the first and second column shouldn't be removed, but the third column because Tom(in column A already exist) and in column B the value is Nan, so the third column should be removed.
Another situation is
A B C D
[Jack, Nan, fish, dolphine]
in the whole dataframe, we only have one row that the value in column A is Jack, so no matter B is nan or not, we don't remove this column.
Upvotes: 1
Views: 1019
Reputation: 121
Just a slight edit to @Brennan's excellent answer, for some reason for me this
(row[nan_col] is np.nan)
did not work, since it was not picking up NaN values?
I had to use
(pandas.isnull(row[nan_col]))
so the solution became
df = df[df.apply(lambda row:
not((pandas.isnull(row[nan_col]))
&
(len(df[df['A'] == row[dup_col]]) > 1)
), axis=1)]
Upvotes: 0
Reputation: 514
You can achieve your desired result using one line:
df = df[df.apply(lambda row: not((row['B'] is np.nan) & (len(df[df['A'] == row[dup_col]]) > 1)), axis=1)]
The solution here is to use df.apply()
in conjunction with python's lambda
function.
import pandas as pd
import numpy as np
data = {
'A':['Tom', 'Tom', 'Tom','Jack'],
'B':['Jane', 'Zack', np.nan,np.nan],
'C':['Jane', 'Bear' , 'Cat','Bear'],
'D':['Jane', 'Bear' , 'Cat','Bear'],
}
# Create the data frame
df = pd.DataFrame(data)
# Set columns to check for duplicate and np.nan
dup_col = 'A'
nan_col = 'B'
# Print df before filter
print(df.head())
A B C D
0 Tom Jane Jane Jane
1 Tom Zack Bear Bear
2 Tom NaN Cat Cat
3 Jack NaN Bear Bear
Use df.apply
to apply a function to the along the axis, specifying axis=1
to apply the function to each row.
I broke this into multiple lines to make it easier to follow, but it can really be done on one line.
df = df[
df.apply(lambda row:
not(
(row[nan_col] is np.nan) & (len(df[df[dup_col] == row[dup_col]]) > 1)
), axis=1)
]
# Print after filter
print(df.head())
A B C D
0 Tom Jane Jane Jane
1 Tom Zack Bear Bear
3 Jack NaN Bear Bear
Upvotes: 1
Reputation: 379
this is the solution I found:
is_na = df['B'].isna() #This transformation (NaN -> True/False) is necessary to count
new_df = df[is_na].filter(['A'])
new_df['B'] = is_na #new_df has column A and column B with Trues and Falses
counting_nans = new_df.groupby('A')['B'].count()
counting_nans has the number of NaNs grouped by values of column A:
>>> df
A B C D
0 Tom Jane Cat Bear
1 Tom Jenny Monkey Tortue
2 Tom NaN Fish Cow
3 Zac NaN Dog Penguin
>>> counting_nans
A
Tom 1
Zac 1
Name: B, dtype: int64
In uniques, we will store all the values that must be evaluated.
uniques = df['A'].value_counts()
>>> uniques
Tom 3
Zac 1
Name: A, dtype: int64
Now, let's filter this out. If the number of times a value appears in column 'A' is equal to the number of NaNs in column 'B', the rows should not be deleted, and if it appears only once in 'A', we can take this out too (doesn't matter if at that specific row df['B'] is NaN or not)
uniques.sort_index(inplace=True)
counting_nans.sort_index(inplace=True)
uniques = uniques[ uniques != counting_nans]
uniques = uniques[ uniques > 1 ]
condition = df['A'].isin(uniques.index) & df['B'].isna()
#This is an array with Trues when df['A'] is in values to be evaluated and df['B'] is NaN
index_condition = condition.loc[condition == True].index #These are the indexes
df.drop(index_condition, inplace=True) #This eliminates the rows
>>> df
A B C D
0 Tom Jane Cat Bear
1 Tom Jenny Monkey Tortue
3 Zac NaN Dog Penguin
Hope it helps! Let me know if my code is not clear. Also, I'm pretty sure there is an easier way, I'm kinda new to programming xD
Upvotes: 0