Reputation: 3375
I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes. Some columns consists of numbers, I dont want to change them. Columns example can be seen below:
TRAIN FEATURES
Age Level
32.0 Silver
61.0 Silver
66.0 Silver
36.0 Gold
20.0 Silver
29.0 Silver
46.0 Silver
27.0 Silver
Thank you=)
Upvotes: 38
Views: 92916
Reputation: 38179
4 years since the creation of this question and I believe there's still not a definitive answer.
I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0). As an example:
import pandas as pd
import datetime
df = pd.DataFrame({
'str': ['a', 'b', 'c', None],
'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})
string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))
heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))
prints
object
True
object
True
so although hete
does not contain any explicit strings, it is considered as a string series.
After reading the documentation, I think the only way to make sure a series contains only strings is:
def is_string_series(s : pd.Series):
if isinstance(s.dtype, pd.StringDtype):
# The series was explicitly created as a string series (Pandas>=1.0.0)
return True
elif s.dtype == 'object':
# Object series, check each value
return all((v is None) or isinstance(v, str) for v in s)
else:
return False
print(is_string_series(string_series))
print(is_string_series(heterogenous_series))
prints
True
False
It seems like the recently released Pandas 2 behaves the same way (the test script above produces the same output with Python 3.11).
Upvotes: 26
Reputation: 699
With Pandas 1.0 convert_dtypes
was introduced. When a column was not explicitly created as StringDtype
it can be easily converted.
pd.StringDtype.is_dtype
will then return True
for wtring columns. Even when they contain NA values.
For old and new style strings the complete series of checks could be something like this:
def has_string_type(s: pd.Series) -> bool:
if pd.StringDtype.is_dtype(s.dtype):
# StringDtype extension type
return True
if s.dtype != "object":
# No object column - definitely no string
return False
try:
s.str
except AttributeError:
return False
# The str accessor exists, this must be a String column
return True
Upvotes: 3
Reputation: 195
This will return a list of column name whose dtype is string(object in this case)
#let df be your dataframe
df.columns[df.dtypes==object].tolist()
Upvotes: 0
Reputation: 446
As far as I can tell, the only sure fire way to know what types are there is to check the values, then you can do an assertion to see if it's what you expect.
The below function will get the dtypes of each value in a column, drop duplicates and then cast to a list so you can view/interact with it. This will let you deal with mixed types, objects and NAs the way you wish (of course np.nan is of type float but I leave such things to the interested reader)
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3, 4],
"col2": ["a", "b", "c", "d"],
"col3": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
})
print(df.dtypes.to_dict())
# {'col1': dtype('int64'), 'col2': dtype('O'), 'col3': dtype('O')}
def true_dtype(df): # You could add a column filter here too
return {col: df[col].apply(lambda x: type(x)).unique().tolist() for col in df.columns}
true_types = true_dtype(df)
print(true_types)
# {'col1': [<class 'int'>], 'col2': [<class 'str'>], 'col3': [<class 'list'>]}
print(true_types['col2'] == [str])
# True
Upvotes: 0
Reputation: 4278
I use a 2-step approach: first to determine if dtype==object
, and then if so, I got the first row of data to see if that column's data was a string or not.
c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
# do something
Upvotes: 15
Reputation: 19027
Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.
Using object
is more restrictive (although I am not sure if other dtypes
would also of object
dtype):
Create the dataframe:
df = pd.DataFrame({
'a': ['a','b','c','d'],
'b': [1, 'b', 'c', 2],
'c': [np.nan, 2, 3, 4],
'd': ['A', 'B', 'B', 'A'],
'e': pd.to_datetime('today')})
df['d'] = df['d'].astype('category')
That will look like this:
a b c d e
0 a 1 NaN A 2018-05-17
1 b b 2.0 B 2018-05-17
2 c c 3.0 B 2018-05-17
3 d 2 4.0 A 2018-05-17
You can check the types calling dtypes
:
df.dtypes
a object
b object
c float64
d category
e datetime64[ns]
dtype: object
You can list the strings columns using the items()
method and filtering by object
:
> [ col for col, dt in df.dtypes.items() if dt == object]
['a', 'b']
Or you can use select_dtypes to display a dataframe with only the strings:
df.select_dtypes(include=[object])
a b
0 a 1
1 b b
2 c c
3 d 2
Upvotes: 16
Reputation: 12515
Expanding on Scratch'N'Purr's answer:
>>> df = pd.DataFrame({'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4]})
>>> df
a b c
0 a 1 NaN
1 b b 2.0
2 c c 3.0
3 d 2 4.0
>>> dict(filter(lambda x: x[1] != np.number, list(zip(df.columns, df.dtypes))))
{'a': dtype('O'), 'b': dtype('O')}
So I've added some columns with mixed types. You can see that the filter
+ dict
approach yields key: value mappings of which columns have dtypes outside of the bounds of np.number
. This ought to work well at scale. You could also try coercing each column to a specific type (e.g. int
) and then catch the ValueError
exception when you can't convert a string column to int
. Lots of ways to do this.
Upvotes: 1
Reputation: 10399
Yes, its possible. You use dtype
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
print('yes')
else:
print('no')
You can also select your columns by dtype using select_dtypes
df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset
Upvotes: 12