Reputation: 15
I have data in a table that is assigned to various individuals. However, a lot of the data is identical based on MD5 and I want each entry to be attributed to each person with the same MD5. For example, the input data might looks like the following:
However, I want the output to be as per the table below. By way of example, Owner for A,C,E is the same as their MD5 is identical (appreciate MD5 would be longer, this is just for illustration). Similarly for B,D,F, the MD5 is the same, so the Owner is also merged for all 3 entries.
I have written a script that works (but is hugely inefficient):
md5OwnerDF = pd.DataFrame({"MD5 Digest":[],"All_Owners":[]})
for md5 in uniqueTLMD5List:
# sub dataframe with md5
md5DF = df.loc[df['MD5 Digest'] == md5].copy()
# bring in owners of this sub dataframe (md5DF) and convert into unique list (ownerList)
ownerList = ';'.join([str(elem) for elem in md5DF.loc[:,'Owner'].unique().tolist()])
md5OwnerDF_temp = pd.DataFrame({"MD5 Digest":[md5],"All_Owners":[ownerList]})
md5OwnerDF = md5OwnerDF.append(md5CustodianDF_temp)
Is there a better way of achieving the same result without iteration?
Thank you in advance.
Upvotes: 0
Views: 51
Reputation: 323266
You can try with transform
+ groupby
df['new'] = df.roupby('ID')['MD5'].transform(','.join)
Upvotes: 2