Marisa
Marisa

Reputation: 1183

How to keep just the numbers on a column dataframe that also has strings?

I have the following column in my dataframe:

Column1   Column2    Column3     Column4
a            1           2           a     
1            2           a           c
b            3           c           d
3            2           1           b
4            2           1           a
c            1           d           a

The type of these columns is object, I would like to convert Column1, Column2 and Column3 to numeric type int8 while keeping Column4 as type object. To do so, I had tried to use pd.to_numeric(data.Column1) (I was planning to do the same after with Column2 and Column3) but I get the following error:

ValueError: Unable to parse string "a" at position 0

Which is obvious why it is happening. I was wondering if there is any way that would allow me to get rid of of that rows formed by strings in these 3 columns so after that, I would get:

Column1    Column 2    Column 3   Column 4
3            2           1           b
4            2           1           a

Is there a way to achieve so? Or is there another way that would allow me to?

Edit: I have checked the question in Remove non-numeric rows in one column with pandas, but it didn't solve my problem, as I have more columns in my dataset than just two and one of them I don't want to convert it to numeric.

Upvotes: 2

Views: 3363

Answers (1)

jezrael
jezrael

Reputation: 862481

Use apply with to_numeric for replace non numeric to missing values, then remove NaNs rows by dropna and last cast to integers:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).dropna().astype(int)
print (df)
   Column1  Column2  Column3
3        3        2        1
4        4        2        1

Detail:

print (df.apply(lambda x: pd.to_numeric(x,errors='coerce')))
   Column1  Column2  Column3
0      NaN      1.0      2.0
1      1.0      2.0      NaN
2      NaN      NaN      NaN
3      3.0      2.0      1.0
4      4.0      2.0      1.0
5      NaN      1.0      NaN

EDIT:

Another solution is check if not missing values with DataFrame.all with boolean indexing:

cols = ['Column1','Column2','Column3']
#define columns for check numeric
mask = df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce')).notnull().all(axis=1)
#filtering
df = df[mask]
#converting to integers
df[cols] = df[cols].astype(int)
print (df)
   Column1  Column2  Column3 Column4
3        3        2        1       b
4        4        2        1       a

print (df.dtypes)
Column1     int32
Column2     int32
Column3     int32
Column4    object
dtype: object

Upvotes: 3

Related Questions