Serene
Serene

Reputation: 21

For Loop to Return Unique Values in DataFrame

I'm working through a beginner's ML code, and in order to count the number of unique samples in a column, the author uses this code:

def unique_vals(rows, col):
    """Find the unique values for a column in a dataset."""
    return set([row[col] for row in rows])

I am working with a DataFrame however, and for me, this code returns single letters: 'm', 'l', etc. I tried altering it to:

set(row[row[col] for row in rows)

But then it returns:

KeyError: "None of [Index(['Apple', 'Banana', 'Grape'   dtype='object', length=2318)] are in the [columns]"

Thanks for your time!

Upvotes: 1

Views: 10437

Answers (2)

Suhas_Pote
Suhas_Pote

Reputation: 4580

If you are working on categorical columns then following code is very useful

It will not only print the unique values but also print the count of each unique value

col = ['col1', 'col2', 'col3'...., 'coln']

#Print frequency of categories
for col in categorical_columns:
    print ('\nFrequency of Categories for varible %s'%col)
    print (bd1[col].value_counts())

Example:

df

     pets     location     owner
0     cat    San_Diego     Champ
1     dog     New_York       Ron
2     cat     New_York     Brick
3  monkey    San_Diego     Champ
4     dog    San_Diego  Veronica
5     dog     New_York       Ron


categorical_columns = ['pets','owner','location']
#Print frequency of categories
for col in categorical_columns:
    print ('\nFrequency of Categories for varible %s'%col)
    print (df[col].value_counts())

Output:

# Frequency of Categories for varible pets
# dog       3
# cat       2
# monkey    1
# Name: pets, dtype: int64

# Frequency of Categories for varible owner
# Champ       2
# Ron         2
# Brick       1
# Veronica    1
# Name: owner, dtype: int64

# Frequency of Categories for varible location
# New_York     3
# San_Diego    3
# Name: location, dtype: int64

Upvotes: 3

gmds
gmds

Reputation: 19885

In general, you don't need to do such things yourself because pandas already does them for you.

In this case, what you want is the unique method, which you can call on a Series directly (the pd.Series is the abstraction that represents, among other things, columns), and which returns a numpy array containing the unique values in that Series.

If you want the unique values for multiple columns, you can do something like this:

which_columns = ... # specify the columns whose unique values you want here

uniques = {col: df[col].unique() for col in which_columns}

Upvotes: 5

Related Questions