s900n
s900n

Reputation: 3375

Python: Check if dataframe column contain string type

I want check if columns in a dataframe consists of strings so I can label them with numbers for machine learning purposes. Some columns consists of numbers, I dont want to change them. Columns example can be seen below:

TRAIN FEATURES
  Age              Level  
  32.0              Silver      
  61.0              Silver  
  66.0              Silver      
  36.0              Gold      
  20.0              Silver     
  29.0              Silver     
  46.0              Silver  
  27.0              Silver      

Thank you=)

Upvotes: 38

Views: 92916

Answers (8)

vc 74
vc 74

Reputation: 38179

4 years since the creation of this question and I believe there's still not a definitive answer.

I don't think strings were ever considered as a first class citizen in Pandas (even >= 1.0.0). As an example:

import pandas as pd
import datetime

df = pd.DataFrame({
    'str': ['a', 'b', 'c', None],
    'hete': [1, 2.0, datetime.datetime.utcnow(), None]
})

string_series = df['str']
print(string_series.dtype)
print(pd.api.types.is_string_dtype(string_series.dtype))

heterogenous_series = df['hete']
print(heterogenous_series.dtype)
print(pd.api.types.is_string_dtype(heterogenous_series.dtype))

prints

object
True
object
True

so although hete does not contain any explicit strings, it is considered as a string series.

After reading the documentation, I think the only way to make sure a series contains only strings is:

def is_string_series(s : pd.Series):
    if isinstance(s.dtype, pd.StringDtype):
        # The series was explicitly created as a string series (Pandas>=1.0.0)
        return True
    elif s.dtype == 'object':
        # Object series, check each value
        return all((v is None) or isinstance(v, str) for v in s)
    else:
        return False


print(is_string_series(string_series))
print(is_string_series(heterogenous_series))

prints

True
False

April 2023 Update

It seems like the recently released Pandas 2 behaves the same way (the test script above produces the same output with Python 3.11).

Upvotes: 26

Yourstruly
Yourstruly

Reputation: 699

With Pandas 1.0 convert_dtypes was introduced. When a column was not explicitly created as StringDtype it can be easily converted.

pd.StringDtype.is_dtype will then return True for wtring columns. Even when they contain NA values.

For old and new style strings the complete series of checks could be something like this:

def has_string_type(s: pd.Series) -> bool:
    if pd.StringDtype.is_dtype(s.dtype):
        # StringDtype extension type
        return True

    if s.dtype != "object":
        # No object column - definitely no string
        return False

    try:
        s.str
    except AttributeError:
        return False

    # The str accessor exists, this must be a String column
    return True

Upvotes: 3

yohoo
yohoo

Reputation: 195

This will return a list of column name whose dtype is string(object in this case)

#let df be your dataframe     
df.columns[df.dtypes==object].tolist()

Upvotes: 0

DataMacGyver
DataMacGyver

Reputation: 446

As far as I can tell, the only sure fire way to know what types are there is to check the values, then you can do an assertion to see if it's what you expect.

The below function will get the dtypes of each value in a column, drop duplicates and then cast to a list so you can view/interact with it. This will let you deal with mixed types, objects and NAs the way you wish (of course np.nan is of type float but I leave such things to the interested reader)

import pandas as pd

df = pd.DataFrame({"col1": [1, 2, 3, 4],
                   "col2": ["a", "b", "c", "d"],
                   "col3": [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
                   })

print(df.dtypes.to_dict())
# {'col1': dtype('int64'), 'col2': dtype('O'), 'col3': dtype('O')}

def true_dtype(df): # You could add a column filter here too
    return {col: df[col].apply(lambda x: type(x)).unique().tolist() for col in df.columns}

true_types = true_dtype(df)
print(true_types)
# {'col1': [<class 'int'>], 'col2': [<class 'str'>], 'col3': [<class 'list'>]}

print(true_types['col2'] == [str])
# True

Upvotes: 0

hamx0r
hamx0r

Reputation: 4278

I use a 2-step approach: first to determine if dtype==object, and then if so, I got the first row of data to see if that column's data was a string or not.

c = 'my_column_name'
if df[c].dtype == object and isinstance(df.iloc[0][c], str):
    # do something 

Upvotes: 15

toto_tico
toto_tico

Reputation: 19027

Notice that the above answers will include DateTime, TimeStamp, Category and other datatypes.

Using object is more restrictive (although I am not sure if other dtypes would also of object dtype):

  1. Create the dataframe:

    df = pd.DataFrame({
        'a': ['a','b','c','d'], 
        'b': [1, 'b', 'c', 2], 
        'c': [np.nan, 2, 3, 4], 
        'd': ['A', 'B', 'B', 'A'], 
        'e': pd.to_datetime('today')})
    df['d'] = df['d'].astype('category')
    

That will look like this:

   a  b    c  d          e
0  a  1  NaN  A 2018-05-17
1  b  b  2.0  B 2018-05-17
2  c  c  3.0  B 2018-05-17
3  d  2  4.0  A 2018-05-17
  1. You can check the types calling dtypes:

    df.dtypes
    
    a            object
    b            object
    c           float64
    d          category
    e    datetime64[ns]
    dtype: object
    
  2. You can list the strings columns using the items() method and filtering by object:

    > [ col  for col, dt in df.dtypes.items() if dt == object]
    ['a', 'b']
    
  3. Or you can use select_dtypes to display a dataframe with only the strings:

    df.select_dtypes(include=[object])
       a  b
    0  a  1
    1  b  b
    2  c  c
    3  d  2
    

Upvotes: 16

boot-scootin
boot-scootin

Reputation: 12515

Expanding on Scratch'N'Purr's answer:

>>> df = pd.DataFrame({'a': ['a','b','c','d'], 'b': [1, 'b', 'c', 2], 'c': [np.nan, 2, 3, 4]})
>>> df 
   a  b    c
0  a  1  NaN
1  b  b  2.0
2  c  c  3.0
3  d  2  4.0

>>> dict(filter(lambda x: x[1] != np.number, list(zip(df.columns, df.dtypes))))
{'a': dtype('O'), 'b': dtype('O')}

So I've added some columns with mixed types. You can see that the filter + dict approach yields key: value mappings of which columns have dtypes outside of the bounds of np.number. This ought to work well at scale. You could also try coercing each column to a specific type (e.g. int) and then catch the ValueError exception when you can't convert a string column to int. Lots of ways to do this.

Upvotes: 1

Scratch&#39;N&#39;Purr
Scratch&#39;N&#39;Purr

Reputation: 10399

Yes, its possible. You use dtype

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['a','b','c','d']})
if df['a'].dtype != np.number:
    print('yes')
else:
    print('no')

You can also select your columns by dtype using select_dtypes

df_subset = df.select_dtypes(exclude=[np.number])
# Now apply you can label encode your df_subset

Upvotes: 12

Related Questions