Reputation: 17617
I have a pandas data frame and I suspect that it contains some strings
>>> d2
1 2 3 4 5 6 7 8 9 10 ... 1771 \
0 0 0 0 0 0 0 0 0 0 0 ... 0
1 0 0 0 0 0 0 0 0 0 0 ... 0
2 0 0 0 0 0 0 0 0 0 0 ... 0
3 0 0 0 0 0 0 0 0 0 0 ... 0
4 0 0 0 0 0 0 0 0 0 0 ... 0
5 0 0 0 0 0 0 0 0 0 0 ... 0
6 0 0 0 0 0 0 0 0 0 0 ... 0
7 0 0 0 0 0 0 0 0 0 0 ... 0
8 0 0 0 0 0 0 0 0 0 0 ... 0
9 0 0 0 0 0 0 0 0 0 0 ... 0
1772 1773 1774 1775 1776 1777 1778 1779 1780
0 0 0 0 0 0 0 1 398 2
1 0 0 0 0 0 0 1 398 2
2 0 0 0 0 0 0 1 398 2
3 0 0 0 0 0 0 1 398 2
4 0 0 0 0 0 0 1 398 2
5 0 0 0 0 0 0 1 398 2
6 0 0 0 0 0 0 1 398 2
7 0 0 0 0 0 0 1 398 2
8 0 0 0 0 0 0 1 398 2
9 0 0 0 0 0 0 1 398 2
[10 rows x 1780 columns]
>>> any(d2.applymap(lambda x: type(x) == str))
True
>>>
I would like to find which elements are string and in case remove the columns containing these elements.
How can I do that?
I get a strange result. It seems that all the columns have dtype int or float but at the same time it seems that some elements are string. How is this possible?
>>> d2.dtypes.drop_duplicates()
1 int64
1755 float64
dtype: object
>>> any(d2.applymap(lambda x: type(x) == str))
True
Upvotes: 1
Views: 2580
Reputation: 10302
I would say that you are getting false positives because of the method you use.
Here is what I would do:
To select all columns that might have text you could use this command:
df.select_dtypes(include=['object']).columns
Or alternatively:
df.select_dtypes(exclude=['number']).columns
To check if any cell in the dataframe is text use this command:
df.applymap(lambda x: isinstance(x, str)).any().any()
Or drop last .any()
to see all columns which have text and which don't:
df.applymap(lambda x: isinstance(x, str)).any()
Calling any(your_dataframe)
(with dataframe as parameter) gives you false positive.
Upvotes: 2
Reputation: 109636
Check the the type of each column using list comprehension and then exclude objects:
df[[col for col in df if df[col].dtype != 'O']] # 'O' is letter O (not zero)
I'm not sure I understand your comment below, so I will further explain with a simple example:
d2 = pd.DataFrame({'a': [1, 2], 'b': ['a', 1], 'c': [2, 3]})
>>> d2
a b c
0 1 a 2
1 2 1 3
>>> d2.applymap(lambda x: type(x))
a b c
0 <type 'numpy.int64'> <type 'str'> <type 'numpy.int64'>
1 <type 'numpy.int64'> <type 'int'> <type 'numpy.int64'>
>>> d2.applymap(lambda x: type(x) == str)
a b c
0 False True False
1 False False False
Note that you should use isinstance(x, target_type) to check if x is of type target_type:
>>> d2.applymap(lambda x: isinstance(x, str))
a b c
0 False True False
1 False False False
Test the type of each column:
>>> [d2[col].dtype for col in d2]
[dtype('int64'), dtype('O'), dtype('int64')]
The solution clearly works:
>>> d2[[col for col in d2 if d2[col].dtype != 'O']]
a c
0 1 2
1 2 3
List all columns that are of type 'Object':
>>> [col for col in d2 if d2[col].dtype == 'O']
['b']
Upvotes: 1