Reputation: 75
So I have a column in my df
which is filled with strings (Postcodes).
I know there are 8147 unique values in my dataset of ~100.000 samples.
I binary encoded the column, which only created 8 columns which cant be right, because 2^8 equals 256 unique values. It should be 13 columns instead.
I then used this code on my column to find the error:
X_train["Postcode"].unique()
And the result was this strange thing:
['47623', '26506', '41179', '41063', '42283', ..., '01471', '86922', '47624', '86923', '86941']
Length: 143
Categories (8147, object): ['01067', '01069', '01097', '01099', ..., '99991', '99994', '99996', '99998']
type()
reveals this is a pandas.core.arrays.categorical.Categorical
But what the heck is going on here? I am super confused what does Length mean? It shows the correct number of unique object but when I do len()
it returns 143 again. And the 143 seems to important since this seems to be the value which the binary encoder uses. But how can something with 8147 objects have a length of 143?
This is also just a column from a df, namely a pandas series. Maybe you can help me.
Upvotes: 1
Views: 307
Reputation: 3720
If I understand your question correctly, it is because the category definition itself has 8147 entries but the actual series/df column has only 143 entries (or 143 unique entries).
I create my own category to illustrate:
label_type = pd.api.types.CategoricalDtype(categories=["yes", "no", "maybe"], ordered=False)
label_type
CategoricalDtype(categories=['yes', 'no', 'maybe'], ordered=False)
s = pd.Series(['yes','yes','yes','no','no'], dtype=label_type)
print(s)
0 yes
1 yes
2 yes
3 no
4 no
dtype: category
Categories (3, object): ['yes', 'no', 'maybe']
s.unique()
['yes', 'no']
Categories (3, object): ['yes', 'no', 'maybe']
This shows two unique entries in the series but still shows 3 categories - even though 'maybe' doesn't exist in the series. 'maybe' still exists as a valid category even though not present in the series.
Another way to demonstrate:
s2 = pd.Series(['yes','yes','yes','no','no','maybe'], dtype="category")
s2
0 yes
1 yes
2 yes
3 no
4 no
5 maybe
dtype: category
Categories (3, object): ['maybe', 'no', 'yes']
s2[:-1]
0 yes
1 yes
2 yes
3 no
4 no
dtype: category
Categories (3, object): ['maybe', 'no', 'yes']
s2[:-1].unique()
['yes', 'no']
Categories (3, object): ['maybe', 'no', 'yes']
Upvotes: 2