WhoIsJack
WhoIsJack

Reputation: 1498

Selecting string columns in pandas df (equivalent to df.select_dtypes)

Given a Pandas df with different data types, df.select_dtypes can be very useful to keep only desired columns or to get rid of unwanted columns for a particular application.

However, there seems to be no way of addressing string dtypes using this method.

From the docs (emphasis mine):

ValueError
  If both of include and exclude are empty
  If include and exclude have overlapping elements
  If any kind of string dtype is passed in.

and

To select strings you must use the object dtype, but note that this will return all object dtype columns

Indeed, using df.select_dtypes(exclude=['str']) raises an error (although it is a TypeError and not a ValueError as the docs claim) and using df.select_dtypes(exclude=['object']) removes all object columns, not just string columns.


Given a df like this:

df = pd.DataFrame({'int_col':[0,1,2,3,4],
                   'dict_col':[dict() for i in range(5)],
                   'str_col':list('abcde')})

and considering that

df.dtypes

is object for both str_col and dict_col:


What is the best way of excluding or including all string columns?

Upvotes: 9

Views: 19340

Answers (2)

Kwame
Kwame

Reputation: 757

From the pandas source:

~/.local/share/virtualenvs/.../lib/python3.7/site-packages/pandas/core/dtypes/cast.py in invalidate_string_dtypes(dtype_set)
    851     non_string_dtypes = dtype_set - {np.dtype("S").type, np.dtype("<U").type}
    852     if non_string_dtypes != dtype_set:
--> 853         raise TypeError("string dtypes are not allowed, use 'object' instead")
    854 
    855 

best practice is to use object instead, apparently. So, just do

df.select_dtypes(exclude=['object'])

under the assumption that no actual Python objects are in those columns but str types.

Upvotes: 0

cs95
cs95

Reputation: 402553

Option 1

Using df.applymap and type, and equating to str:

In [377]: (df.applymap(type) == str).all(0)
Out[377]: 
dict_col    False
int_col     False
str_col      True
dtype: bool

Each element in each column converted to its type and then equated to str. After that, just call .all(0) or .min(0) to get a per-column verdict.


Option 2

Using df.applymap and isinstance:

In [342]: df.applymap(lambda x: isinstance(x, str)).all(0)
Out[342]: 
dict_col    False
int_col     False
str_col      True

To include these string columns, you can boolean index on the columns:

idx = ... # one of the two methods above
df_new = df[df.columns[idx]]

Exclusion would be

df_new = df[df.columns[~idx]]

Upvotes: 17

Related Questions