Reputation: 567
I would like to split variables into the different types. For example:
Tweets ID Registration Date num_unique_words photo_profile range
object int64 object float64 int64 category
What I did is:
type_dct = {str(k): list(v) for k, v in df.groupby(df.dtypes, axis=1)}
but I have got a TypeError:
TypeError: Cannot interpret 'CategoricalDtype(categories=['<5',
'>=5'], ordered=True)' as a data type
range
can take two values: '<5' and '>=5'.
I hope you can help to handle this error.
df = pd.DataFrame({'Tweets': ['Tweet 1 from user 1', 'Tweet 2 from user 1',
'Tweet 1 from user 3', 'Tweet 10 from user 1'],
'ID': [124, 124, 12, 124],
'Registration Date': ['2020-12-02', '2020-11-21',
'2020-12-02', '2020-12-02'],
'num_unique_words': [41, 42, 12, 69],
'photo_profile': [1, 0, 1, 1],
'range': ['<5', '<5', '>=5', '<5']},
index=['falcon', 'dog', 'spider', 'fish'])
Upvotes: 0
Views: 2968
Reputation: 153460
That was surprisingly more complicated that I thought it would be, but here is a work around using list comprehension:
type_dct = {str(k): list(v) for k, v in df.groupby([i.name for i in df.dtypes], axis=1)}
Output:
{'category': ['range'],
'int64': ['ID', 'num_unique_words', 'photo_profile'],
'object': ['Tweets', 'Registration Date']}
pd.CategorialDtypes by itself doesn't work well in the groupby, we must use the name attribute of that object.
Use pd.DataFrame.select_dtypes
Example from docs.
df = pd.DataFrame({'a': [1, 2] * 3,
'b': [True, False] * 3,
'c': [1.0, 2.0] * 3})
df
a b c
0 1 True 1.0
1 2 False 2.0
2 1 True 1.0
3 2 False 2.0
4 1 True 1.0
5 2 False 2.0
df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
4 True
5 False
df.select_dtypes(include=['float64'])
c
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
5 2.0
df.select_dtypes(exclude=['int64'])
b c
0 True 1.0
1 False 2.0
2 True 1.0
3 False 2.0
4 True 1.0
5 False 2.0
Upvotes: 1