Reputation: 1088
This question is the extension of this one: How to keep track of element from another dataframe in pandas. Suppose I have the following dataframe:
df = pd.DataFrame({"call 1": ['debit card','bond',np.nan],
"call 2": ['debit card','mortgage','spending limit'],
"call 3":['payment limit','debit card',np.nan]})
which is:
call 1 call 2 call 3
0 debit card debit card payment limit
1 bond mortgage debit card
2 NaN spending limit NaN
I further do some clustering and I get the next dataframe:
dfc = pd.DataFrame( {'cluster 1': ['payment limit', 'spending limit',np.nan],
'cluster 2': ['debit card', 'debit card', 'debit card'],
'cluster 3': [ 'bond', 'mortgage',np.nan]})
which is
cluster 1 cluster 2 cluster 3
0 payment limit debit card bond
1 spending limit debit card mortgage
2 NaN debit card NaN
Now I want to keep track of the elements of the clustering table. Namely, I want to know which elements comes from which call, the desired result should look something like:
cluster 1 cluster 2 cluster 3
0 {call 3, payment limit} {call 1, debit card} {call 1, bond}
1 {call 2, pending limit} {call 2, debit card} {call 2, mortgage}
2 NaN {call 3, debit card} NaN
Using the methods in the previous question would fail, firstly because in many calls (i.e. 1,2 and 3) the elements are not distinct. Secondly,
dfc.applymap(lambda x: df[df.eq(x)].dropna(how='all').dropna(axis=1).to_dict('records')[0])
would work if we had 2 debit cards, for anything above 2 I get
IndexError: list index out of range
simply because the method above tries to make a dictionary.
Update: Using
lookup_dict = {}
look_df = df.T
for col in look_df.columns:
lookup_dict.update(dict(zip(look_df[col], look_df.index)))
pd.concat([dfc.replace(lookup_dict), dfc]).astype(str).groupby(level=0).agg(lambda x: dict([tuple(x)]))
would also fail, as all debit cards
are counted to be from call 3
which is wrong. In fact, the above gives:
cluster 1 cluster 2 cluster 3
0 {'call 3': 'payment limit'} {'call 3': 'debit card'} {'call 1': 'bond'}
1 {'call 2': 'spending limit'} {'call 3': 'debit card'} {'call 2': 'mortgage'}
2 {'call 3': 'nan'} {'call 3': 'debit card'} {'call 3': 'nan'}
Upvotes: 1
Views: 83
Reputation: 13478
With the dataframes you provided, here is one way to do it by using an intermediary defaultdict from Python standard library collections module.
data = defaultdict(list)
# Make a dictionary pairing unique values from df with call number
# For instance, 'debit card': ['call 1', 'call 2', 'call 3']
for record in df.to_dict(orient="records"):
for k, v in record.items():
data[v].append(k)
# Update dfc accordingly
for col in dfc.columns:
dfc[col] = dfc[col].apply(
lambda x: dict([(data[x].pop(0), x)])
if isinstance(data[x], list)
else dict([(data[x], x)])
)
print(dfc)
# Output
cluster 1 cluster 2 cluster 3
0 {'call 3': 'payment limit'} {'call 1': 'debit card'} {'call 1': 'bond'}
1 {'call 2': 'spending limit'} {'call 2': 'debit card'} {'call 2': 'mortgage'}
2 {'call 1': nan} {'call 3': 'debit card'} {'call 3': nan}
Upvotes: 1