Ryan Gross
Ryan Gross

Reputation: 6515

Detecting similar columns across multiple files based on statistical profile

I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match statistical profiles of the columns to data from files that do have good headers. This seems like it should be simple using something like Pandas Profiling, but I haven't found any examples. I'm looking for something that would calculate a score for the similarity between each column in the header-less file and each "known" column for which I already have headers.

Example Data with Headers:

Large Value Column Small Value Column
100 5
102 12
110 8
98 10

Example Data with only column numbers:

0 1
6 99
9 105
11 101
14 100

For the above example, I would like to automatically determine that column 1 should be added to "Large Value Column" and Column 0 to "Small Value Column".

Upvotes: 0

Views: 349

Answers (2)

Ingwersen_erik
Ingwersen_erik

Reputation: 2263

As stated by @Arthur Bricq, the best solution depends a lot on the shape of your data. Unfortunately, here's no "one solution fits all approach". Having said that, the following code was able to correctly rename the columns from your example, using the fuzzywuzzy package:

Note: You need to pip install fuzzywuzzy before running the code below. To do so, run the following command:

pip install fuzzywuzzy


from typing import Tuple, List
from fuzzywuzzy import fuzz
import pandas as pd


def find_similar(
    *dfs: pd.DataFrame, known_df: pd.DataFrame
) -> Tuple[List[pd.DataFrame], pd.DataFrame]:
    """
    Name columns in a list of dataframes based on the columns from known dataframe.

    Each of the dataframes in the list will be renamed based on the columns from the
    known dataframe. To avoid cases where two different unnamed columns match the
    same column from the known dataframe, the function matches each of the known columns
    to the unnamed column that contains the highest similarity score and then removes
    the matched columns from the list of possible columns. For example:

    >>> x = pd.DataFrame(
    ...     {'Large Value Column' :[100, 102, 110, 98,]
    ...      'Small Value Column': [5, 12, 8, 10,]}
    ... )
    >>> y = pd.DataFrame(
    ...     {0: [6, 9, 11, 14],
    ...      1: [20, 25, 55, 65]}
    ... )
    >>> pd.DataFrame(
    ...     [
    ...         [
    ...             fuzz.ratio(
    ...                 list(map(str, y[col].values)), list(map(str, x[_col].values))
    ...             ) for _col in x.columns
    ...         ] for col in y.columns
    ...     ],
    ...     index=y.columns,
    ...     columns=x.columns,
    ... )
       Large Value Column  Small Value Column
    0                  73                  77
    1                  71                  74

    In the above example, "Small Value Column" has the highest similarity score for
    both columns "0" and "1". However, since "0" has the highest similarity score
    then, "1" will be renamed to "Large Value Column".

    Parameters
    ----------
    dfs : pd.DataFrame
        List of dataframes to rename.
    known_df : pd.DataFrame
        Pandas dataframe with the correct column names.

    Returns
    -------
    Tuple[List[pd.DataFrame], pd.DataFrame]
        List of dataframes with renamed columns, and the known dataframe.
    """
    # Store the results in a list
    _dfs = []
    # Loop over the dataframes
    for df in dfs:
        # Create a dataframe that contains the similarities between the columns
        # in the known dataframe and the columns in the current dataframe.
        # The columns represent the known dataframe columns, and the index
        # represents the current dataframe columns.
        similarity_df = pd.DataFrame(
            [
                [
                    fuzz.ratio(
                        list(map(str, df[col].values)),
                        list(map(str, known_df[_col].values)),
                    )
                    for _col in known_df.columns
                ]
                for col in df.columns
            ],
            index=df.columns,
            columns=known_df.columns,
        )
        # Dictionary to map old and new column names
        rename_dict = {}

        # Keep track of the columns that have already been matched
        # to avoid naming two columns using the same name.
        used_names = []
        for col in similarity_df.max().sort_values(ascending=False).keys():
            old_name = similarity_df[~similarity_df.index.isin(used_names)][
                col
            ].idxmax()
            rename_dict[old_name] = col
            used_names.append(old_name)
        # Rename the columns in the current dataframe and append it to the list
        _dfs.append(df.rename(columns=rename_dict, errors="ignore"))
    return _dfs, known_df

Example

import pandas as pd


x = pd.DataFrame(
    {
        "Large Value Column": [100, 102, 110, 98],
        "Small Value Column": [5, 12, 8, 10],
    }
)
y = pd.DataFrame({0: [6, 9, 11, 14], 1: [20, 40, 55, 65]})
z = pd.DataFrame({0: [95, 80, 72, 100], 1: [0, 20, 14, 10]})

pd.concat(find_similar(y, z, known_df=x)[0])

# Returns:
#
#    Small Value Column  Large Value Column
# 0                   6                  20
# 1                   9                  40
# 2                  11                  55
# 3                  14                  65
# 0                   0                  95
# 1                  20                  80
# 2                  14                  72
# 3                  10                 100

enter image description here

Upvotes: 1

Arthur Bricq
Arthur Bricq

Reputation: 386

That's an interesting question.

The solution that will work heavily depends on the data you're working with. But here's my thoughts, assuming that your data is similar to the little exampled above, just longer.

The first step would be to create a descriptor that takes as input a row and returns a low-dimension vector that describes this row. For instance: the mean, the standard deviation, the median, ... This should be designed according to what actually describes your data. Then, for each column of your dataset, you're going to have a descriptor vector and (eventually) a label (ie. the column's header), that may or not be known.

The next step can either be classification or clustering, depending on how much data you have and the quality of your descriptors. Look up Scikit documentation about how to do so. Their API is very easy to use and comes with many examples. If you know the number of real headers, you could try to use K-means and I think it would work really well. Maybe KNN can also work. You probably want to look for a sorting algorithm quite 'simple' at first (it's easier to tune), and if nothing simple works then you could try more fancy approaches.

I hope this helps you.

Upvotes: 1

Related Questions