Reputation: 109
I have the following code to find the columns in a data frame with the lowest number of distinct values and list them.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df)
unique_counts = df.nunique()
lowest_distinct = 100
#
#Find the lowest distinct count across all columns
#
for column_name, distinct_count in unique_counts.iteritems():
if distinct_count < lowest_distinct:
lowest_distinct = distinct_count
lowest_distinct_columns = []
#
#Collect the columns having that count
#
for column_name, distinct_count in unique_counts.iteritems():
if distinct_count == lowest_distinct:
lowest_distinct_columns.append(column_name)
#
#Get the columns and values returned as a data frame
#
melted_df = df.melt(value_vars=lowest_distinct_columns,var_name='column', value_name='value')
print(melted_df)
It feels a bit clunky so I'm wondering if there is a better way to do it? Ultimately I'm trying to get a list of the columns and values that have the lowest number of distinct values.
Any thoughts or tips appreciated.
Cheers
David
Upvotes: 1
Views: 1010
Reputation: 1
Use
df1 = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df1)
unique_counts = df1.nunique()
A B C D
0 1 2 1 3
1 2 3 1 3
2 3 4 2 4
unique_counts[unique_counts==unique_counts.min()]
C 2
D 2
dtype: int64
Upvotes: 0
Reputation: 109
Thank you for the responses. The 3 solutions to the first part of the problem work equally well and the 2 responses to the second part of the problem also work very well.
I'll need to use them in practice to see if there is any material difference in performance or behaviour but to summarise the complete solutions:
@Parfait's solution:
unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A 3
# B 3
# C 2
# D 2
lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C 2
# D 2
final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C (1, 2)
# D (3, 4)
and @Priker's
unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts ==
lowest_distinct].index.tolist()
result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})
Upvotes: 0
Reputation: 107687
For older versions of pandas (< v.20), consider apply
to return a series:
unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A 3
# B 3
# C 2
# D 2
lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C 2
# D 2
final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C (1, 2)
# D (3, 4)
Upvotes: 1
Reputation: 958
Does it do what you want:
unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts == lowest_distinct].index.tolist()
result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})
Upvotes: 1
Reputation: 76947
Use
In [114]: df[unique_count[unique_count == unique_count.min()].index].melt(
var_name='column', value_name='value')
Out[114]:
column value
0 C 1
1 C 1
2 C 2
3 D 3
4 D 3
5 D 4
Upvotes: 1