Dread
Dread

Reputation: 861

Compare dataframe column with strings and numbers

I have a dataframe with two columns that have strings and numbers. When I compare the two columns, they don't match even though they appear to be the same.

example data:

old_code     new_code
100000       100000

When I compare, the result is false:

df['old_code'] == df['new_code']
0    False
dtype: bool

The datatypes are the same:

df.dtypes
old_code    object
new_code    object
dtype: object

I tried stripping the columns in case there is whitespace but they still don't match:

df['old_code'].str.strip() == df['new_code'].str.strip() 
0    False
dtype: bool

I made the dataframe into a dictionary to see if I could spot anything amiss and I noticed that for some reason one of the columns is formatted as a string and the other an integer:

d = df.to_dict()
d
{'old_code': {0: '100000'}, 'new_code': {0: 100000}}

Any suggestions on how to compare these columns? I don't want to change the data type to integer since some of the rows are truly strings.

Upvotes: 3

Views: 13635

Answers (2)

Karn Kumar
Karn Kumar

Reputation: 8816

What about using numpy:

convert to int:

>>> import numpy as np
>>> df['old_code'].astype(dtype=np.int64)
0    100000
Name: old_code, dtype: int64

convert to float:

>>> df['old_code'].astype(dtype=np.float64)
0    100000.0
Name: old_code, dtype: float64

convert to string:

>>> df['old_code'].astype(dtype=np.str)
0    100000
Name: old_code, dtype: object

DataFrame.apply() can be used with pd.to_numeric

>>> df.apply(pd.to_numeric, errors='coerce')
   old_code  new_code
0    100000    100000

Upvotes: 1

jpp
jpp

Reputation: 164623

object dtype series can hold anything

The issue is object dtype series contain arbitrary Python objects. Here your series have a string in one and an integer in another:

df = pd.DataFrame({'old_code': ['100000'], 'new_code': [100000]}, dtype=object)

print(df.dtypes)
# new_code    object
# old_code    object
# dtype: object

print(type(df['old_code'].iat[0]))
# <class 'str'>

print(type(df['new_code'].iat[0]))
# <class 'int'>

Therefore, be consistent. Convert both to numeric or both to strings.

Convert to numeric

cols = ['old_code', 'new_code']
df[cols] = df[cols].apply(pd.to_numeric)

Convert to strings

df[cols] = df[cols].apply(lambda s: s.astype(str))

Without modifying your dataframe

Strictly speaking, you don't need to update your dataframe. You can caste each series for comparison purposes only:

# numeric comparison
pd.to_numeric(df['old_code']) == pd.to_numeric(df['new_code'])

# string comparison
df['old_code'].astype(str) == df['new_code'].astype(str)

Upvotes: 12

Related Questions