Reputation: 187
Given a pd.DataFrame containing several columns of categorical variables, what is the most efficient way to identify a subset of those columns that uniquely identifies the rows of the pd.DataFrame (assuming such a subset exists)?
In many cases there may already be a unique index. E.g., the column 'ID' below:
Otherwise several columns must be combined to form a unique identifier. E.g. the columns ['Name', 'Level'] below:
Upvotes: 2
Views: 283
Reputation: 13518
You could try this:
from itertools import combinations
import pandas as pd
# Toy dataframe
df = pd.DataFrame(
{
"name": ["Georgia", "Georgia", "Florida"],
"level 1": ["value", "other value", "other value"],
"level 2": ["Sub-country", "Sub-country", "Country"],
"level 3": ["Sub-country", "Country", "Sub-country"],
"value 1": ["a", "b", "c"],
"value 2": ["d", "e", "f"],
}
)
print(df)
name level 1 level 2 level 3 value 1 value 2
0 Georgia value Sub-country Sub-country a d
1 Georgia other value Sub-country Country b e
2 Florida other value Country Sub-country c f
# Setup
unique_identifiers = []
target_cols = ["name", "level 1", "level 2", "level 3"]
# Identify all combinations
combined_cols = [list(combinations(target_cols, i)) for i in range(1, len(target_cols))]
combined_cols = [list(cols) for item in combined_cols for cols in item]
# Identify unique identifiers
for cols in combined_cols:
if df.loc[:, cols].duplicated().sum():
continue
else:
unique_identifiers.append(cols)
# Get all the smallest unique identifiers (in case more than one)
smallest = len(min(unique_identifiers))
if smallest:
unique_identifiers = [item for item in unique_identifiers if len(item) == smallest]
print(unique_identifiers)
# Outputs
[
["name", "level 1"],
["name", "level 3"],
["level 1", "level 2"],
["level 1", "level 3"],
["level 2", "level 3"],
]
Upvotes: 1