Donbeo
Donbeo

Reputation: 17617

find position of string elements in pandas dataframe

I have a pandas data frame and I suspect that it contains some strings

>>> d2
   1     2     3     4     5     6     7     8     9     10    ...   1771  \
0     0     0     0     0     0     0     0     0     0     0  ...      0   
1     0     0     0     0     0     0     0     0     0     0  ...      0   
2     0     0     0     0     0     0     0     0     0     0  ...      0   
3     0     0     0     0     0     0     0     0     0     0  ...      0   
4     0     0     0     0     0     0     0     0     0     0  ...      0   
5     0     0     0     0     0     0     0     0     0     0  ...      0   
6     0     0     0     0     0     0     0     0     0     0  ...      0   
7     0     0     0     0     0     0     0     0     0     0  ...      0   
8     0     0     0     0     0     0     0     0     0     0  ...      0   
9     0     0     0     0     0     0     0     0     0     0  ...      0   

   1772  1773  1774  1775  1776  1777  1778  1779  1780  
0     0     0     0     0     0     0     1   398     2  
1     0     0     0     0     0     0     1   398     2  
2     0     0     0     0     0     0     1   398     2  
3     0     0     0     0     0     0     1   398     2  
4     0     0     0     0     0     0     1   398     2  
5     0     0     0     0     0     0     1   398     2  
6     0     0     0     0     0     0     1   398     2  
7     0     0     0     0     0     0     1   398     2  
8     0     0     0     0     0     0     1   398     2  
9     0     0     0     0     0     0     1   398     2  

[10 rows x 1780 columns]
>>> any(d2.applymap(lambda x: type(x) == str))
True
>>>  

I would like to find which elements are string and in case remove the columns containing these elements.

How can I do that?

I get a strange result. It seems that all the columns have dtype int or float but at the same time it seems that some elements are string. How is this possible?

>>> d2.dtypes.drop_duplicates()
1         int64
1755    float64
dtype: object
>>> any(d2.applymap(lambda x: type(x) == str))
True

Upvotes: 1

Views: 2580

Answers (2)

Primer
Primer

Reputation: 10302

I would say that you are getting false positives because of the method you use.

Here is what I would do:

To select all columns that might have text you could use this command:

df.select_dtypes(include=['object']).columns

Or alternatively:

df.select_dtypes(exclude=['number']).columns

To check if any cell in the dataframe is text use this command:

df.applymap(lambda x: isinstance(x, str)).any().any()

Or drop last .any() to see all columns which have text and which don't:

df.applymap(lambda x: isinstance(x, str)).any()

Calling any(your_dataframe) (with dataframe as parameter) gives you false positive.

Upvotes: 2

Alexander
Alexander

Reputation: 109636

Check the the type of each column using list comprehension and then exclude objects:

df[[col for col in df if df[col].dtype != 'O']]  # 'O' is letter O (not zero)

I'm not sure I understand your comment below, so I will further explain with a simple example:

d2 = pd.DataFrame({'a': [1, 2], 'b': ['a', 1], 'c': [2, 3]})

>>> d2
   a  b  c
0  1  a  2
1  2  1  3

>>> d2.applymap(lambda x: type(x))
                      a             b                     c
0  <type 'numpy.int64'>  <type 'str'>  <type 'numpy.int64'>
1  <type 'numpy.int64'>  <type 'int'>  <type 'numpy.int64'>

>>> d2.applymap(lambda x: type(x) == str)
       a      b      c
0  False   True  False
1  False  False  False

Note that you should use isinstance(x, target_type) to check if x is of type target_type:

>>> d2.applymap(lambda x: isinstance(x, str))
       a      b      c
0  False   True  False
1  False  False  False

Test the type of each column:

>>> [d2[col].dtype for col in d2]
[dtype('int64'), dtype('O'), dtype('int64')]

The solution clearly works:

>>> d2[[col for col in d2 if d2[col].dtype != 'O']]
   a  c
0  1  2
1  2  3

List all columns that are of type 'Object':

>>> [col for col in d2 if d2[col].dtype == 'O']
['b']

Upvotes: 1

Related Questions