Reputation: 109

Pandas DataFrame - list columns with lowest distinct values

I have the following code to find the columns in a data frame with the lowest number of distinct values and list them.

import pandas as pd

df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df)
unique_counts = df.nunique()

lowest_distinct = 100
#
#Find the lowest distinct count across all columns
#
for column_name, distinct_count in unique_counts.iteritems():

    if distinct_count < lowest_distinct:
        lowest_distinct = distinct_count

lowest_distinct_columns = []
#
#Collect the columns having that count
#
for column_name, distinct_count in unique_counts.iteritems():

    if distinct_count == lowest_distinct:
        lowest_distinct_columns.append(column_name)

#
#Get the columns and values returned as a data frame
#
melted_df = df.melt(value_vars=lowest_distinct_columns,var_name='column', value_name='value')

print(melted_df)

It feels a bit clunky so I'm wondering if there is a better way to do it? Ultimately I'm trying to get a list of the columns and values that have the lowest number of distinct values.

Any thoughts or tips appreciated.

Cheers

David

Upvotes: 1

Answers (5)

Ramraj V

Reputation: 1

Use

df1 = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df1)
unique_counts = df1.nunique()

   A  B  C  D
0  1  2  1  3
1  2  3  1  3
2  3  4  2  4

unique_counts[unique_counts==unique_counts.min()]

C    2
D    2
dtype: int64

Upvotes: 0

Bravid

Reputation: 109

Thank you for the responses. The 3 solutions to the first part of the problem work equally well and the 2 responses to the second part of the problem also work very well.

I'll need to use them in practice to see if there is any material difference in performance or behaviour but to summarise the complete solutions:

@Parfait's solution:

unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A    3
# B    3
# C    2
# D    2

lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C    2
# D    2

final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C    (1, 2)
# D    (3, 4)

and @Priker's

unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts == 
lowest_distinct].index.tolist()

result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})

Upvotes: 0

Parfait

Reputation: 107687

For older versions of pandas (< v.20), consider apply to return a series:

unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A    3
# B    3
# C    2
# D    2

lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C    2
# D    2

final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C    (1, 2)
# D    (3, 4)

Upvotes: 1

Prikers

Reputation: 958

Does it do what you want:

unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts == lowest_distinct].index.tolist()

result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})

Upvotes: 1

Zero

Reputation: 76947

Use

In [114]: df[unique_count[unique_count == unique_count.min()].index].melt(
                var_name='column', value_name='value')
Out[114]:
  column  value
0      C      1
1      C      1
2      C      2
3      D      3
4      D      3
5      D      4

Upvotes: 1

Pandas DataFrame - list columns with lowest distinct values

Answers (5)

Related Questions