Rations
Rations

Reputation: 187

Identify the smallest subset of columns that uniquely identifies the rows of a pandas DataFrame

Given a pd.DataFrame containing several columns of categorical variables, what is the most efficient way to identify a subset of those columns that uniquely identifies the rows of the pd.DataFrame (assuming such a subset exists)?

In many cases there may already be a unique index. E.g., the column 'ID' below:

enter image description here

Otherwise several columns must be combined to form a unique identifier. E.g. the columns ['Name', 'Level'] below:

enter image description here

Upvotes: 2

Views: 283

Answers (1)

Laurent
Laurent

Reputation: 13518

You could try this:

from itertools import combinations

import pandas as pd

# Toy dataframe
df = pd.DataFrame(
    {
        "name": ["Georgia", "Georgia", "Florida"],
        "level 1": ["value", "other value", "other value"],
        "level 2": ["Sub-country", "Sub-country", "Country"],
        "level 3": ["Sub-country", "Country", "Sub-country"],
        "value 1": ["a", "b", "c"],
        "value 2": ["d", "e", "f"],
    }
)
print(df)
      name      level 1      level 2      level 3 value 1 value 2
0  Georgia        value  Sub-country  Sub-country       a       d
1  Georgia  other value  Sub-country      Country       b       e
2  Florida  other value      Country  Sub-country       c       f

# Setup
unique_identifiers = []
target_cols = ["name", "level 1", "level 2", "level 3"]

# Identify all combinations
combined_cols = [list(combinations(target_cols, i)) for i in range(1, len(target_cols))]
combined_cols = [list(cols) for item in combined_cols for cols in item]

# Identify unique identifiers
for cols in combined_cols:
    if df.loc[:, cols].duplicated().sum():
        continue
    else:
        unique_identifiers.append(cols)

# Get all the smallest unique identifiers (in case more than one)
smallest = len(min(unique_identifiers))
if smallest:
    unique_identifiers = [item for item in unique_identifiers if len(item) == smallest]

print(unique_identifiers)
# Outputs
[
    ["name", "level 1"],
    ["name", "level 3"],
    ["level 1", "level 2"],
    ["level 1", "level 3"],
    ["level 2", "level 3"],
]

Upvotes: 1

Related Questions