Reputation: 8638
df =
Col1 Col2 Col3
1 nan 4
2 5 4
3 3 nan
Given the dataframe df
, I want to obtain a new dataframe df2
that does not contain nan
in the column Col2
. This is the expected result:
df2 =
Col1 Col2 Col3
2 5 4
3 3 nan
I know that it's possible to use pandas.isnull
and dropna
, however how to specify only particular column to which filtering should be applied?
Upvotes: 35
Views: 67581
Reputation: 5075
The simple implementation below follows on from the above - but shows filtering out nan rows in a specific column - in place - and for large data frames count rows with nan by column name (before and after)
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,np.nan,'A100'],[4,5,'A213'],[7,8,np.nan],[10,np.nan,'GA23']])
df.columns = ['areaCode','Distance','accountCode']
dataframe
areaCode Distance accountCode
1 NaN A100
4 5.0 A213
7 8.0 NaN
10 NaN GA23
Before: count rows with nan (for each column):
df.isnull().sum()
count by column:
areaCode 0
Distance 2
accountCode 1
dtype: int64
remove unwanted rows in-place:
df.dropna(subset=['Distance'],inplace=True)
After: count rows with nan (for each column):
df.isnull().sum()
count by column:
areaCode 0
Distance 0
accountCode 1
dtype: int64
dataframe:
areaCode Distance accountCode
4 5.0 A213
7 8.0 NaN
Upvotes: 0
Reputation: 83
If you want to count and graph the number of nan's before dropping your column(s)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
cols = df.columns
nans = [df[col].isna().sum() for col in cols]
sns.set(font_scale=1.1)
ax = sns.barplot(cols, nans, palette='hls', log=False)
ax.set(xlabel='Feature', ylabel='Number of NaNs', title='Number of NaNs per feature')
for p, uniq in zip(ax.patches, nans):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
uniq,
ha="center")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
Upvotes: 0
Reputation: 210972
you can use DataFrame.dropna()
method:
In [202]: df.dropna(subset=['Col2'])
Out[202]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or (in this case) less idiomatic Series.notnull():
In [204]: df.loc[df.Col2.notnull()]
Out[204]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
or using DataFrame.query() method:
In [205]: df.query("Col2 == Col2")
Out[205]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
numexpr
solution:
In [241]: import numexpr as ne
In [242]: col = df.Col2
In [243]: df[ne.evaluate("col == col")]
Out[243]:
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Upvotes: 57
Reputation: 294516
Using numpy
's isnan
to mask and construct a new dataframe
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
Col1 Col2 Col3
1 2.0 5.0 4.0
2 3.0 3.0 NaN
Timing
Bigger Data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice([np.nan, 1], size=(10000, 10))).add_prefix('Col')
%%timeit
m = ~np.isnan(df.Col2.values)
pd.DataFrame(df.values[m], df.index[m], df.columns)
1000 loops, best of 3: 326 µs per loop
%timeit df.query("Col2 == Col2")
1000 loops, best of 3: 1.48 ms per loop
%timeit df.loc[df.Col2.notnull()]
1000 loops, best of 3: 417 µs per loop
%timeit df[~df['Col2'].isnull()]
1000 loops, best of 3: 385 µs per loop
%timeit df.dropna(subset=['Col2'])
1000 loops, best of 3: 913 µs per loop
Upvotes: 6
Reputation: 863531
Use dropna
:
df = df.dropna(subset=['Col2'])
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Another solution - boolean indexing
with notnull
:
df = df[df['Col2'].notnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
What is same as:
df = df[~df['Col2'].isnull()]
print (df)
Col1 Col2 Col3
1 2 5.0 4.0
2 3 3.0 NaN
Upvotes: 15