Reputation: 1088
Good people, still learning python.Now I have a very huge dataframe of around 1 million rows however I do what to delete rows which have missing values from certain column location. I mean after locating rows with missing values from certain column, I should be able to delete the entire row.Look at the sample dataframe:
import pandas as pd
data = {'A':['NW', 'NB', 'UK', 'CAN'],'B':['Tom', 'nick', 'krish', 'jack'], 'C':[20, '', 19, ''],'D':[20, '', 19, ''],'E':[20, '', 19, ''],'F':[20, '', 19, '']}
df = pd.DataFrame(data)
print(df)
in this dataframe, I want to identify rows with missing values from column C to column F then delete entire row, which means I have to delete row 1 and row 3. I have tried this but it aint working :
df.dropna(subset=['C','D','E','F'], how='all', inplace = True)
EXPECTED OUTPUT should be a table like this
import pandas as pd
data = {'A':['NW', 'UK'],'B':['Tom', 'krish'], 'C':[20, 19],'D':[20, 19 ],'E':[20, 19 ],'F':[20, 19]}
df = pd.DataFrame(data)
print(df)
Upvotes: 2
Views: 1939
Reputation: 294576
pandas.DataFrame.mask
mask
to make the dataframe null anywhere a condition is True
dropna
with the subset
argumentThis is instead of using replace
as mentioned by zipa and jezrael. There isn't any advantage to using this over the other methods. I just like using mask
. It also serves the community to show another method for accomplishing the task.
df.mask(df.eq('')).dropna(subset=['C', 'D', 'E', 'F'])
A B C D E F
0 NW Tom 20 20 20 20
2 UK krish 19 19 19 19
Upvotes: 4
Reputation: 27899
This should do it:
df.replace(r'^\s*$', np.nan, regex=True).dropna()
Upvotes: 3
Reputation: 863791
One solution is compare empty strings in list of columns and get at least one one ''
per rows:
df = df[df[['C','D','E','F']].ne('').any(1)]
Details:
print(df[['C','D','E','F']].ne(''))
C D E F
0 True True True True
1 False False False False
2 True True True True
3 False False False False
print (df[['C','D','E','F']].ne('').any(1))
0 True
1 False
2 True
3 False
dtype: bool
Alternative with inverted condition by ~
:
df = df[~df[['C','D','E','F']].eq('').all(1)]
Details:
print(df[['C','D','E','F']].eq(''))
C D E F
0 False False False False
1 True True True True
2 False False False False
3 True True True True
print (df[['C','D','E','F']].eq('').all(1))
0 False
1 True
2 False
3 True
dtype: bool
print (~df[['C','D','E','F']].eq('').all(1))
0 True
1 False
2 True
3 False
dtype: bool
print(df)
A B C D E F
0 NW Tom 20 20 20 20
2 UK krish 19 19 19 19
For working with your solution is necessary replace empty strings to missing values first:
df = df.replace('',np.nan)
df.dropna(subset=['C','D','E','F'], how='all', inplace = True)
print(df)
A B C D E F
0 NW Tom 20.0 20.0 20.0 20.0
2 UK krish 19.0 19.0 19.0 19.0
Upvotes: 2