Reputation: 73
Here's the input data. As you can see in input data below, code 1 & 2 actually represent the same person as you find "Kim S Tom" as "compare" in code 2 and also as "agent" in code 1. Ideally, I'd like to have all the name variations under unique agent name such as "Tom Kim".
Input data:
Here's the ideal outcome I'm looking for so that I have a mapping of all four variations of identical person's names under unique agent name "Tom Kim". It doesn't matter which name I use for agent as long as it's unique identifier.
What I expect:
Here is my code and I'm not quite sure how to get to the ideal outcome I'm looking for as above. I really appreciate any suggestions as I've been working on this for quite sometime now... Thanks!
data = {'agent':['Tom Kim', 'Tom Kim', 'Kim S Tom', 'Kim S Tom','Catherine W','Catherine W'], 'compare':['Tom Kim','Kim S Tom','Tommy Kim','Kim Sun Tom','Wood Catherine','Catherine W']}
names = pd.DataFrame(data)
names.agent = pd.Categorical(names.agent)
names['code'] = names.agent.cat.codes
names_unique=names[['code']].drop_duplicates()
names_unique = names_unique.values.tolist()
mapping=pd.DataFrame(columns=['agent', 'compare'])
for f in list(names_unique):
agent_f=names[names['code'] == f]['compare']
agent_f=list(agent_f)
agent_name_f = names[names['code'] == f]['agent'][0]
agent_f_df=names[names['code'] == f][['agent','compare']]
mapping=mapping.append(agent_f_df)
names_unique.remove(f)
for s in list(names_unique):
agent_s = names[names['code'] == s]['agent'][0]
agent_s_df=names[names['code'] == s][['agent','compare']]
agent_name_s = names[names['code'] == s]['agent'][0]
if agent_s in agent_f:
agent_s_df['agent'] = agent_name_f
mapping=mapping.append(agent_s_df)
names_unique.remove(s)
else:
print(agent_name_f.strip().lower()+' AND '+agent_name_s.strip().lower()+' DID NOT MATCH')
Outcome I get: I get error message as below if I run codes above. I can run this code error free if there's no nested loop. That is only for first part of "f" looping and ignore "s" part of nested loop, then it runs.
ValueError: list.remove(x): x not in list
Upvotes: 0
Views: 123
Reputation: 96
I hope the following code will be helpful! It does create the output as you need it.
import pandas as pd
data = {'agent':['Tom Kim', 'Tom Kim', 'Kim S Tom', 'Kim S Tom','Catherine W','Catherine W'], 'compare':['Tom Kim','Kim S Tom','Tommy Kim','Kim Sun Tom','Wood Catherine','Catherine W']}
names = pd.DataFrame(data)
names.agent = pd.Categorical(names.agent)
names['code'] = names.agent.cat.codes
# create unique names list, adding those names that are identical in both 'agent' and 'compare' columns
names_unique = []
for i in names.index:
if names['agent'][i] == names['compare'][i] and names['agent'][i] not in names_unique:
names_unique.append(names['agent'][i])
# create dictory to store output data, keys: unique names, values: list of all possible names
output_data = {}
for n in names_unique:
output_data[n] = [n]
# for each row of the input datafram
for i in names.index:
# for each key/values of the output data dictonary
for key, values in output_data.items():
# check if one between 'agent' or 'compare' is already among the possible names list of the unique name represented by the key selected by the for loop
# and check if the other column isn't already among the possible names
if names['agent'][i] in values and names['compare'][i] not in values:
# add the name to the possible names list
output_data[key].append(names['compare'][i])
if names['compare'][i] in values and names['agent'][i] not in values:
# add the name to the possible names list
output_data[key].append(names['agent'][i])
# organise your data to easily create your output dataframe
df_structure = {'agent': [], 'compare':[]}
for k in output_data.keys():
for v in output_data[k]:
df_structure['agent'].append(k)
df_structure['compare'].append(v)
# create output dataframe
output_df = pd.DataFrame(df_structure)
output_df
The core of its functioning is in the for
loop used to iterate over all the rows of the input database.
At each row it checks if either the name in the column 'agent' or the one in 'compare' is already inside one of the lists of all possible names (the values of the dictionary output_data
).
If one of them is already inside, it means that the name in the other column is another possible name. Thus, it checks if the name in the other column is already in the list of possible names, too. If not, it will be added.
In this way the lists of possible names get updated after each row of the input database called by the for
loop.
EDIT:
Alternative code that fixes the problem of wrong unique names in the case proposed by the author of the question in the comments:
import pandas as pd
data = {'agent':['Tom Kim', 'Tom Kim', 'Kim S Tom', 'Kim S Tom','Catherine W','Catherine W','Tom Kim'], 'compare':['Tom Kim','Kim S Tom','Tommy Kim','Kim S Tom','Wood Catherine','Catherine W','Kim Sum Tom']}
names = pd.DataFrame(data)
names.agent = pd.Categorical(names.agent)
names['code'] = names.agent.cat.codes
#create a list that will contain the groups of possible names
list_name_groups = []
#iterate over each row of the datafram "names"
for i in names.index:
# create variables to make the code more readable
agent = names['agent'][i]
compare = names['compare'][i]
# variable to check if the names in this row are in any group, starts with 0 (== not in any group)
is_in_any_group = 0
# iterate over each group of names already added to the list
for group in list_name_groups:
# if agent in the group, compare is added to the list of alternative names
if (agent in group) and (compare not in group):
group.append(compare)
is_in_any_group = 1 # an existing group is found
break
# if compare in the group, agent is added to the list of alternative names
if (compare in group) and (agent not in group):
group.append(agent)
is_in_any_group = 1 # an existing group is found
break
# in case both agent and compare are already in the group
if (compare in group) and (agent in group):
is_in_any_group = 1 # an existing group is found
break
# if neither agent nor compare are in the group
if is_in_any_group == 0:
if agent == compare: # avoid duplicates
list_name_groups.append([agent])
else: # the two names are different and both not in the group
list_name_groups.append([agent, compare])
# define the structure of the output dataframe
df_structure = {'agent': [], 'compare':[]}
for group in list_name_groups:
for name in group:
df_structure['agent'].append(group[0]) # the repeated name in the "agent" column is the first of the group
df_structure['compare'].append(name)
# create output dataframe
output_df = pd.DataFrame(df_structure)
output_df
This one works differently, it uses a list of lists instead of a dictionary.
To pick the name that will be repeated in the column "agent" of the output dataframe, the code selects the first of the list of possible alternative names. I'm not sure if you need to follow a particular criterion ... in that case you may want to modify the final part of the code, where the structure of the output dataframe is defined!
Upvotes: 1