Detecting similar columns across multiple files based on statistical profile

Question

I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match statistical profiles of the columns to data from files that do have good headers. This seems like it should be simple using something like Pandas Profiling, but I haven't found any examples. I'm looking for something that would calculate a score for the similarity between each column in the header-less file and each "known" column for which I already have headers.

Example Data with Headers:

Large Value Column	Small Value Column
100	5
102	12
110	8
98	10

Example Data with only column numbers:

0	1
6	99
9	105
11	101
14	100

For the above example, I would like to automatically determine that column 1 should be added to "Large Value Column" and Column 0 to "Small Value Column".

Ingwersen_erik · Accepted Answer

As stated by @Arthur Bricq, the best solution depends a lot on the shape of your data. Unfortunately, here's no "one solution fits all approach". Having said that, the following code was able to correctly rename the columns from your example, using the fuzzywuzzy package:

Note: You need to pip install fuzzywuzzy before running the code below. To do so, run the following command:

pip install fuzzywuzzy


from typing import Tuple, List
from fuzzywuzzy import fuzz
import pandas as pd


def find_similar(
    *dfs: pd.DataFrame, known_df: pd.DataFrame
) -> Tuple[List[pd.DataFrame], pd.DataFrame]:
    """
    Name columns in a list of dataframes based on the columns from known dataframe.

    Each of the dataframes in the list will be renamed based on the columns from the
    known dataframe. To avoid cases where two different unnamed columns match the
    same column from the known dataframe, the function matches each of the known columns
    to the unnamed column that contains the highest similarity score and then removes
    the matched columns from the list of possible columns. For example:

    >>> x = pd.DataFrame(
    ...     {'Large Value Column' :[100, 102, 110, 98,]
    ...      'Small Value Column': [5, 12, 8, 10,]}
    ... )
    >>> y = pd.DataFrame(
    ...     {0: [6, 9, 11, 14],
    ...      1: [20, 25, 55, 65]}
    ... )
    >>> pd.DataFrame(
    ...     [
    ...         [
    ...             fuzz.ratio(
    ...                 list(map(str, y[col].values)), list(map(str, x[_col].values))
    ...             ) for _col in x.columns
    ...         ] for col in y.columns
    ...     ],
    ...     index=y.columns,
    ...     columns=x.columns,
    ... )
       Large Value Column  Small Value Column
    0                  73                  77
    1                  71                  74

    In the above example, "Small Value Column" has the highest similarity score for
    both columns "0" and "1". However, since "0" has the highest similarity score
    then, "1" will be renamed to "Large Value Column".

    Parameters
    ----------
    dfs : pd.DataFrame
        List of dataframes to rename.
    known_df : pd.DataFrame
        Pandas dataframe with the correct column names.

    Returns
    -------
    Tuple[List[pd.DataFrame], pd.DataFrame]
        List of dataframes with renamed columns, and the known dataframe.
    """
    # Store the results in a list
    _dfs = []
    # Loop over the dataframes
    for df in dfs:
        # Create a dataframe that contains the similarities between the columns
        # in the known dataframe and the columns in the current dataframe.
        # The columns represent the known dataframe columns, and the index
        # represents the current dataframe columns.
        similarity_df = pd.DataFrame(
            [
                [
                    fuzz.ratio(
                        list(map(str, df[col].values)),
                        list(map(str, known_df[_col].values)),
                    )
                    for _col in known_df.columns
                ]
                for col in df.columns
            ],
            index=df.columns,
            columns=known_df.columns,
        )
        # Dictionary to map old and new column names
        rename_dict = {}

        # Keep track of the columns that have already been matched
        # to avoid naming two columns using the same name.
        used_names = []
        for col in similarity_df.max().sort_values(ascending=False).keys():
            old_name = similarity_df[~similarity_df.index.isin(used_names)][
                col
            ].idxmax()
            rename_dict[old_name] = col
            used_names.append(old_name)
        # Rename the columns in the current dataframe and append it to the list
        _dfs.append(df.rename(columns=rename_dict, errors="ignore"))
    return _dfs, known_df

Example

import pandas as pd


x = pd.DataFrame(
    {
        "Large Value Column": [100, 102, 110, 98],
        "Small Value Column": [5, 12, 8, 10],
    }
)
y = pd.DataFrame({0: [6, 9, 11, 14], 1: [20, 40, 55, 65]})
z = pd.DataFrame({0: [95, 80, 72, 100], 1: [0, 20, 14, 10]})

pd.concat(find_similar(y, z, known_df=x)[0])

# Returns:
#
#    Small Value Column  Large Value Column
# 0                   6                  20
# 1                   9                  40
# 2                  11                  55
# 3                  14                  65
# 0                   0                  95
# 1                  20                  80
# 2                  14                  72
# 3                  10                 100

Detecting similar columns across multiple files based on statistical profile

Answers (2)

Example

Related Questions