Reputation: 861
I have a dataframe with two columns that have strings and numbers. When I compare the two columns, they don't match even though they appear to be the same.
example data:
old_code new_code
100000 100000
When I compare, the result is false:
df['old_code'] == df['new_code']
0 False
dtype: bool
The datatypes are the same:
df.dtypes
old_code object
new_code object
dtype: object
I tried stripping the columns in case there is whitespace but they still don't match:
df['old_code'].str.strip() == df['new_code'].str.strip()
0 False
dtype: bool
I made the dataframe into a dictionary to see if I could spot anything amiss and I noticed that for some reason one of the columns is formatted as a string and the other an integer:
d = df.to_dict()
d
{'old_code': {0: '100000'}, 'new_code': {0: 100000}}
Any suggestions on how to compare these columns? I don't want to change the data type to integer since some of the rows are truly strings.
Upvotes: 3
Views: 13635
Reputation: 8816
What about using numpy:
convert to int:
>>> import numpy as np
>>> df['old_code'].astype(dtype=np.int64)
0 100000
Name: old_code, dtype: int64
convert to float:
>>> df['old_code'].astype(dtype=np.float64)
0 100000.0
Name: old_code, dtype: float64
convert to string:
>>> df['old_code'].astype(dtype=np.str)
0 100000
Name: old_code, dtype: object
DataFrame.apply()
can be used with pd.to_numeric
>>> df.apply(pd.to_numeric, errors='coerce')
old_code new_code
0 100000 100000
Upvotes: 1
Reputation: 164623
object
dtype series can hold anythingThe issue is object
dtype series contain arbitrary Python objects. Here your series have a string in one and an integer in another:
df = pd.DataFrame({'old_code': ['100000'], 'new_code': [100000]}, dtype=object)
print(df.dtypes)
# new_code object
# old_code object
# dtype: object
print(type(df['old_code'].iat[0]))
# <class 'str'>
print(type(df['new_code'].iat[0]))
# <class 'int'>
Therefore, be consistent. Convert both to numeric or both to strings.
cols = ['old_code', 'new_code']
df[cols] = df[cols].apply(pd.to_numeric)
df[cols] = df[cols].apply(lambda s: s.astype(str))
Strictly speaking, you don't need to update your dataframe. You can caste each series for comparison purposes only:
# numeric comparison
pd.to_numeric(df['old_code']) == pd.to_numeric(df['new_code'])
# string comparison
df['old_code'].astype(str) == df['new_code'].astype(str)
Upvotes: 12