Wiliam
Wiliam

Reputation: 1088

Keep track of elements of a dataframe

This question is the extension of this one: How to keep track of element from another dataframe in pandas. Suppose I have the following dataframe:

df = pd.DataFrame({"call 1": ['debit card','bond',np.nan],
                  "call 2": ['debit card','mortgage','spending limit'],
                  "call 3":['payment limit','debit card',np.nan]})

which is:

       call 1          call 2         call 3
0  debit card      debit card  payment limit
1        bond        mortgage     debit card
2         NaN  spending limit            NaN

I further do some clustering and I get the next dataframe:

dfc = pd.DataFrame( {'cluster 1': ['payment limit', 'spending limit',np.nan],
 'cluster 2': ['debit card', 'debit card', 'debit card'],
 'cluster 3': [ 'bond', 'mortgage',np.nan]})

which is

        cluster 1   cluster 2 cluster 3
0   payment limit  debit card      bond
1  spending limit  debit card  mortgage
2             NaN  debit card       NaN

Now I want to keep track of the elements of the clustering table. Namely, I want to know which elements comes from which call, the desired result should look something like:

    cluster 1                cluster 2                 cluster 3
0   {call 3, payment limit}  {call 1, debit card}      {call 1, bond}
1  {call 2, pending limit}  {call 2, debit card}    {call 2, mortgage}
2             NaN           {call 3, debit card}         NaN

Using the methods in the previous question would fail, firstly because in many calls (i.e. 1,2 and 3) the elements are not distinct. Secondly,

dfc.applymap(lambda x: df[df.eq(x)].dropna(how='all').dropna(axis=1).to_dict('records')[0])

would work if we had 2 debit cards, for anything above 2 I get IndexError: list index out of range simply because the method above tries to make a dictionary.

Update: Using

lookup_dict = {}
look_df = df.T

for col in look_df.columns:
    lookup_dict.update(dict(zip(look_df[col], look_df.index)))
    
pd.concat([dfc.replace(lookup_dict), dfc]).astype(str).groupby(level=0).agg(lambda x: dict([tuple(x)]))

would also fail, as all debit cards are counted to be from call 3 which is wrong. In fact, the above gives:

cluster 1   cluster 2   cluster 3
0   {'call 3': 'payment limit'} {'call 3': 'debit card'}    {'call 1': 'bond'}
1   {'call 2': 'spending limit'}    {'call 3': 'debit card'}    {'call 2': 'mortgage'}
2   {'call 3': 'nan'}   {'call 3': 'debit card'}    {'call 3': 'nan'}

Upvotes: 1

Views: 83

Answers (1)

Laurent
Laurent

Reputation: 13478

With the dataframes you provided, here is one way to do it by using an intermediary defaultdict from Python standard library collections module.

data = defaultdict(list)

# Make a dictionary pairing unique values from df with call number
# For instance, 'debit card': ['call 1', 'call 2', 'call 3']
for record in df.to_dict(orient="records"):
    for k, v in record.items():
        data[v].append(k)

# Update dfc accordingly
for col in dfc.columns:
    dfc[col] = dfc[col].apply(
        lambda x: dict([(data[x].pop(0), x)])
        if isinstance(data[x], list)
        else dict([(data[x], x)])
    )
print(dfc)
# Output
                      cluster 1                 cluster 2               cluster 3
0   {'call 3': 'payment limit'}  {'call 1': 'debit card'}      {'call 1': 'bond'}
1  {'call 2': 'spending limit'}  {'call 2': 'debit card'}  {'call 2': 'mortgage'}
2               {'call 1': nan}  {'call 3': 'debit card'}         {'call 3': nan}

Upvotes: 1

Related Questions