Reputation: 21
I'm working through a beginner's ML code, and in order to count the number of unique samples in a column, the author uses this code:
def unique_vals(rows, col):
"""Find the unique values for a column in a dataset."""
return set([row[col] for row in rows])
I am working with a DataFrame however, and for me, this code returns single letters: 'm', 'l', etc. I tried altering it to:
set(row[row[col] for row in rows)
But then it returns:
KeyError: "None of [Index(['Apple', 'Banana', 'Grape' dtype='object', length=2318)] are in the [columns]"
Thanks for your time!
Upvotes: 1
Views: 10437
Reputation: 4580
If you are working on categorical columns then following code is very useful
It will not only print the unique values but also print the count of each unique value
col = ['col1', 'col2', 'col3'...., 'coln']
#Print frequency of categories
for col in categorical_columns:
print ('\nFrequency of Categories for varible %s'%col)
print (bd1[col].value_counts())
Example:
df
pets location owner
0 cat San_Diego Champ
1 dog New_York Ron
2 cat New_York Brick
3 monkey San_Diego Champ
4 dog San_Diego Veronica
5 dog New_York Ron
categorical_columns = ['pets','owner','location']
#Print frequency of categories
for col in categorical_columns:
print ('\nFrequency of Categories for varible %s'%col)
print (df[col].value_counts())
Output:
# Frequency of Categories for varible pets
# dog 3
# cat 2
# monkey 1
# Name: pets, dtype: int64
# Frequency of Categories for varible owner
# Champ 2
# Ron 2
# Brick 1
# Veronica 1
# Name: owner, dtype: int64
# Frequency of Categories for varible location
# New_York 3
# San_Diego 3
# Name: location, dtype: int64
Upvotes: 3
Reputation: 19885
In general, you don't need to do such things yourself because pandas
already does them for you.
In this case, what you want is the unique
method, which you can call on a Series
directly (the pd.Series
is the abstraction that represents, among other things, columns), and which returns a numpy
array containing the unique values in that Series
.
If you want the unique values for multiple columns, you can do something like this:
which_columns = ... # specify the columns whose unique values you want here
uniques = {col: df[col].unique() for col in which_columns}
Upvotes: 5