Code_Sipra
Code_Sipra

Reputation: 1713

Select only integers from a column of mixed data types in pandas

I have a dataframe df as shown below. The column col2 has null values, blank values, integers and even float values. I want to derive a new dataframe new_df from df where the column col2 has only integer values.

import pandas as pd
import numpy as np

col1 = ["a", "b", "c", "d", "e", "f", "g", "h"]
col2 = ["25.45", "", "200", np.nan, "N/A", "null", "35", "5,300"]

df = pd.DataFrame({"col1": col1, "col2": col2})

This is how df looks:

  col1   col2
0    a  25.45
1    b       
2    c    200
3    d    NaN
4    e    N/A
5    f   null
6    g     35
7    h  5,300

Below is my desired output for new_df where the column col2 values are only integers:

  col1   col2  
2    c    200
6    g     35

I have tried using pd.to_numeric() and even isdigit() functions but they are expecting a series as input. Is there an easy way to get the desired output?

Upvotes: 1

Views: 3620

Answers (2)

Ben
Ben

Reputation: 998

To answer the same question but with slightly different data; imagine that we have the same dataframe, but now with a third column that has a column with strings, integers and floats (including np.nan).

import pandas as pd
import numpy as np

col1 = ["a", "b", "c", "d", "e", "f", "g", "h"]
col2 = ["25.45", "", "200", np.nan, "N/A", "null", "35", "5,300"]
col3 = [25.45, "", 200, np.nan, "N/A", "null", 35, "5,300"] # new column with mixed types

df = pd.DataFrame({"col1": col1, "col2": col2, "col3": col3})
print(df)


    col1    col2    col3
0   a       25.45   25.45
1   b       
2   c       200     200
3   d       NaN     NaN
4   e       N/A     N/A
5   f       null    null
6   g       35      35
7   h       5,300   5,300

To select just integers:

df2 = df.loc[df.col3.apply(lambda x : isinstance(x, int))]
print(df2)

    col1    col2    col3
2   c       200     200
6   g       35      35

And just floats:

df3 = df.loc[df.col3.apply(lambda x : isinstance(x, float))]
print(df3)


    col1    col2    col3
0   a       25.45   25.45
3   d       NaN     NaN

(note that np.nan is a float)

Upvotes: 1

cs95
cs95

Reputation: 402423

str.isdigit

Filter out digits and select by boolean indexing:

df2 = df[df.col2.astype(str).str.isdigit()]    
print(df2)
  col1 col2
2    c  200
6    g   35

P.S., to convert "col2" to integer, use

df2['col2'] = df2['col2'].astype(int)

str.contains

You could also use str.contains, albeit slower, since it uses regex.

df[df.col2.astype(str).str.contains(r'^\d+$')]

  col1 col2
2    c  200
6    g   35

pd.to_numeric

A third solution is somewhat hacky, but uses pd.to_numeric. We need one pre-replace step to filter out floats.

v = df.col2.astype(str).str.replace('.', '|', regex=False)
df[pd.to_numeric(v, errors='coerce').notna()]

  col1 col2
2    c  200
6    g   35

Upvotes: 5

Related Questions