Claudine U
Claudine U

Reputation: 1

dynamic approach to identify and standardize similar names automatically in pandas or data cleaning

I have a DataFrame with a column of publisher names that contains various minor variations of the same publisher. For example, entries such as "Harlequin Romance", "Harlequin Blaze", "Harlequin Superromance", and "Harlequin" all refer to the same publisher, "Harlequin". Similarly, "Hackett Publishing Company Inc.", "Hackett Publ. Co Inc", and "Hackett Publishing Company Inc. (USA)" should be standardized to a single name like "Hackett Publishing Company Inc.".

Given the large size of the DataFrame (over 10,000 rows) and the number of unique publishers (over 100), manually creating a mapping dictionary is impractical. I need a dynamic approach to identify and standardize these similar publisher names automatically.

I have about to try but i have no idea on how to go about it

Upvotes: 0

Views: 51

Answers (2)

Justin Zhang
Justin Zhang

Reputation: 58

I think generally the data analysis method should be straightforward and rough. Just a regex would achieve 95% of your goal.

Suppose df is the dataframe, and the column containing dirty company names is named "Company", then try this:

df['Cleaned_Company_Name'] = (df['Company'].str
                                           .replace(r'\(.*?\)|Co|Ltd|Inc\.',
                                           '',
                                           regex=True) + ' Publishing Company Inc')

Explanation: Remove all the strings you don't wanna see using pd.Series.str.strip method and regex string, then combine the left characters with an unified suffix.

Upvotes: 0

joana
joana

Reputation: 83

import pandas as pd
from rapidfuzz import fuzz
import hdbscan

# Sample DataFrame
data = {
    'publisher': [
        'Harlequin Romance', 'Harlequin Blaze', 'Harlequin Superromance', 'Harlequin',
        'Hackett Publishing Company Inc.', 'Hackett Publ. Co Inc', 'Hackett Publishing Company Inc. (USA)'
    ]
}

df = pd.DataFrame(data)

# Step 1: Compute the similarity matrix using RapidFuzz
def compute_similarity_matrix(names):
    n = len(names)
    similarity_matrix = [[0] * n for _ in range(n)]
    for i in range(n):
        for j in range(i, n):
            similarity = fuzz.ratio(names[i], names[j])
            similarity_matrix[i][j] = similarity
            similarity_matrix[j][i] = similarity
    return similarity_matrix

publishers = df['publisher'].tolist()
similarity_matrix = compute_similarity_matrix(publishers)

# Step 2: Convert similarity matrix to a distance matrix (1 - similarity)
distance_matrix = 1 - (pd.DataFrame(similarity_matrix) / 100)

# Step 3: Cluster similar names using HDBSCAN
clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
cluster_labels = clusterer.fit_predict(distance_matrix)

# Step 4: Create a mapping of publishers to their cluster exemplar
clusters = {}
for label in set(cluster_labels):
    if label != -1:  # Ignore noise points
        cluster_members = [publishers[i] for i in range(len(cluster_labels)) if cluster_labels[i] == label]
        exemplar = max(cluster_members, key=lambda name: sum(fuzz.ratio(name, other) for other in cluster_members))
        for member in cluster_members:
            clusters[member] = exemplar

# Step 5: Apply the mapping to standardize publisher names
df['standardized_publisher'] = df['publisher'].map(clusters)

print(df)

Upvotes: 0

Related Questions