Reputation: 1713
I have a dataframe df
as shown below. The column col2
has null values, blank values, integers and even float values. I want to derive a new dataframe new_df
from df
where the column col2
has only integer values.
import pandas as pd
import numpy as np
col1 = ["a", "b", "c", "d", "e", "f", "g", "h"]
col2 = ["25.45", "", "200", np.nan, "N/A", "null", "35", "5,300"]
df = pd.DataFrame({"col1": col1, "col2": col2})
This is how df
looks:
col1 col2
0 a 25.45
1 b
2 c 200
3 d NaN
4 e N/A
5 f null
6 g 35
7 h 5,300
Below is my desired output for new_df
where the column col2
values are only integers:
col1 col2
2 c 200
6 g 35
I have tried using pd.to_numeric() and even isdigit() functions but they are expecting a series as input. Is there an easy way to get the desired output?
Upvotes: 1
Views: 3620
Reputation: 998
To answer the same question but with slightly different data; imagine that we have the same dataframe, but now with a third column that has a column with strings, integers and floats (including np.nan).
import pandas as pd
import numpy as np
col1 = ["a", "b", "c", "d", "e", "f", "g", "h"]
col2 = ["25.45", "", "200", np.nan, "N/A", "null", "35", "5,300"]
col3 = [25.45, "", 200, np.nan, "N/A", "null", 35, "5,300"] # new column with mixed types
df = pd.DataFrame({"col1": col1, "col2": col2, "col3": col3})
print(df)
col1 col2 col3
0 a 25.45 25.45
1 b
2 c 200 200
3 d NaN NaN
4 e N/A N/A
5 f null null
6 g 35 35
7 h 5,300 5,300
To select just integers:
df2 = df.loc[df.col3.apply(lambda x : isinstance(x, int))]
print(df2)
col1 col2 col3
2 c 200 200
6 g 35 35
And just floats:
df3 = df.loc[df.col3.apply(lambda x : isinstance(x, float))]
print(df3)
col1 col2 col3
0 a 25.45 25.45
3 d NaN NaN
(note that np.nan is a float)
Upvotes: 1
Reputation: 402423
str.isdigit
Filter out digits and select by boolean indexing:
df2 = df[df.col2.astype(str).str.isdigit()]
print(df2)
col1 col2
2 c 200
6 g 35
P.S., to convert "col2" to integer, use
df2['col2'] = df2['col2'].astype(int)
str.contains
You could also use str.contains
, albeit slower, since it uses regex.
df[df.col2.astype(str).str.contains(r'^\d+$')]
col1 col2
2 c 200
6 g 35
pd.to_numeric
A third solution is somewhat hacky, but uses pd.to_numeric
. We need one pre-replace step to filter out floats.
v = df.col2.astype(str).str.replace('.', '|', regex=False)
df[pd.to_numeric(v, errors='coerce').notna()]
col1 col2
2 c 200
6 g 35
Upvotes: 5