compare 2 dataframes, loop through columns, calculate similarity on matching index

Question

I have two nearly identical dataframes (A & B). All the columns in A & B have the same unique identifier in the exact same order.

The index for each DF (A & B) are nearly similar (Produce, deli, dairy etc..). Column A contains a full index while B contains a subset of A but might also at times match A completely.

I already set up a loop to compare each column at the same time. I am having trouble with grabbing the values from each index in every column.

Sample Dataframes:

import numpy as np
import pandas as pd

nan = np.nan

# dfA
dfA = pd.DataFrame(
    np.array(
        [
            [1, 2, nan, nan], 
            [1, nan, 1, 2], 
            [nan,5,6, 1], 
            [4, 3, 3, nan],
            [nan, 1, nan, 2],
            [2, 1, nan, nan],
            [1, 3, 1, 2],
            [nan, nan, nan, 3],
            [3, nan, 1, 2],
            [2, nan, 2, nan],
        ]
    ),
    columns=['165413658', '546817846', '685413554', '54684114'],
    index=['dairy', 'produce', 'hot_deli', 'bakery', 'cold_deli', 'freshmeat', 'flowers', 'fruit', 'beer', 'wine'])

# dfB
dfB = pd.DataFrame(
    np.array(
        [
            [1, 2, nan, 1], 
            [nan, 4, 1, nan], 
            [4, nan, nan, 1], 
            [4, 3, 4, 2],
            [nan, 1, 2, 1],
            [1, nan, nan, 2],
            [nan, 3, 2, nan],
            [1, nan, 1, nan], 
        ]
    ),
    columns=['165413658', '546817846', '685413554', '54684114'],
    index=['dairy', 'bakery', 'wine', 'produce', 'hot_deli', 'cold_deli', 'flowers', 'beer']
)

Heres my code so far:

utility_rate = []
for index, (userid1, userid2) in enumerate(zip(dfA, dfB)):
        act_count_dept = dfA[userid1]
        utility_rate.append(act_count_dept)

this prints out a list of all my data but I'm not sure how to effectively grab the data to compare the columns in dfA and dfB.

Basically, I want to take the nth column of dfA and count the data in the row indices. Then, grab the same nth column of dfB and count count the data in those row indices. Finally, divide the values of dfB / dfA, where they match, and compute the total matches for each index (bakery, produce, deli etc..) This is part of my data validation where dfA is the real data and dfB is my predicted data which Im trying to figure out whether it is predicting similarly to dfA.

desired output:

# dfC
dfC = pd.DataFrame(
    np.array(
        [
            [1, 1, nan, nan], 
            [0.25, nan, 0.25, 1], 
            [nan, 0.2, 0.33, 1], 
            [3, nan, 0.25, nan],
            [nan, nan, nan, 1],
            [nan, nan, nan, nan],
            [nan, 3, 0.5, nan],
            [nan, nan, nan, 3],
            [3, nan, nan, nan],
            [0.5, nan, nan, nan],
        ]
    ),
    columns=['165413658', '546817846', '685413554', '54684114'],
    index=['dairy', 'produce', 'fruit', 'bakery', 'cold_deli', 'freshmeat', 'flowers', 'hot_deli', 'beer', 'wine'])

Youll probably tell by the desired output that some indices from dfA will not be found in dfB and vice versa. Ideally Id like to store that excess data in another DF but that is out of scope for this question.

I think the code will go something like:

utility_rate = []
for index, (userid1, userid2) in enumerate(zip(dfA, dfB)):
        act_count_dept = dfA[userid1] # grab indices and its values
        # grab indices and its values for dfB
        # compare: if dfA index val == dfB index val:
        #                  final_calc= dfB.index / dfA.index
        #           elif no match:
        #              dump into other dfB

        utility_rate.append(final_calc)

SunilG · Accepted Answer

There is a compare() in pandas 1.1.0. You can use it to find the total matching percentage in both dataframes. I haven't tested but the below code should give you a dataframe with equal values. Once you get that dataframe you can calculate the percentage using the original dataframe shape and new dataframe shape.

dfA.compare(dfB,align_axis=1,keep_shape=False,keep_equal=True)

Hope this at least gives you a pointer. Refer this for more ideas:

compare 2 dataframes, loop through columns, calculate similarity on matching index

Answers (1)

Related Questions