Nils Lcrx
Nils Lcrx

Reputation: 75

Pandas dataframes unique() function produces strange object

So I have a column in my df which is filled with strings (Postcodes). I know there are 8147 unique values in my dataset of ~100.000 samples.

I binary encoded the column, which only created 8 columns which cant be right, because 2^8 equals 256 unique values. It should be 13 columns instead.

I then used this code on my column to find the error:

X_train["Postcode"].unique()

And the result was this strange thing:

['47623', '26506', '41179', '41063', '42283', ..., '01471', '86922', '47624', '86923', '86941']
Length: 143
Categories (8147, object): ['01067', '01069', '01097', '01099', ..., '99991', '99994', '99996', '99998']

type() reveals this is a pandas.core.arrays.categorical.Categorical

But what the heck is going on here? I am super confused what does Length mean? It shows the correct number of unique object but when I do len() it returns 143 again. And the 143 seems to important since this seems to be the value which the binary encoder uses. But how can something with 8147 objects have a length of 143?

This is also just a column from a df, namely a pandas series. Maybe you can help me.

Upvotes: 1

Views: 307

Answers (1)

sitting_duck
sitting_duck

Reputation: 3720

If I understand your question correctly, it is because the category definition itself has 8147 entries but the actual series/df column has only 143 entries (or 143 unique entries).

I create my own category to illustrate:

label_type = pd.api.types.CategoricalDtype(categories=["yes", "no", "maybe"], ordered=False)
label_type

CategoricalDtype(categories=['yes', 'no', 'maybe'], ordered=False)

s = pd.Series(['yes','yes','yes','no','no'], dtype=label_type)

print(s)

0    yes
1    yes
2    yes
3     no
4     no
dtype: category
Categories (3, object): ['yes', 'no', 'maybe']

s.unique()

['yes', 'no']
Categories (3, object): ['yes', 'no', 'maybe']

This shows two unique entries in the series but still shows 3 categories - even though 'maybe' doesn't exist in the series. 'maybe' still exists as a valid category even though not present in the series.

Another way to demonstrate:

s2 = pd.Series(['yes','yes','yes','no','no','maybe'], dtype="category")
s2

0      yes
1      yes
2      yes
3       no
4       no
5    maybe
dtype: category
Categories (3, object): ['maybe', 'no', 'yes']

s2[:-1]

0    yes
1    yes
2    yes
3     no
4     no
dtype: category
Categories (3, object): ['maybe', 'no', 'yes']

s2[:-1].unique()

['yes', 'no']
Categories (3, object): ['maybe', 'no', 'yes']

Upvotes: 2

Related Questions