Reputation:
Im new to python and working on hashing algo.
I have a dataframe-
df2
Out[55]:
CID SID
0 2094825 141
1 2327668 583
2 2259956 155
3 1985370 100
4 2417177 47
... ...
1030748 2262027 100
1030749 2232061 100
1030750 2027795 169
1030751 2474609 100
1030752 2335654 169
[1030753 rows x 2 columns]
How do i use the hashlib python library to get a hashing algorithm such that each combination of CID and STD gives me a unique encryption such as CID 2262027 and SID 100 is fj6x55 and CID 2232061 and SID 100 gives another unique encryption of f6223xi, etc. As long as the combinations are unique. I want unique encryptions. If they repeat then the encryption should be same.. Im open to other suggestions like one hot encoding too incase hashlib is not working. So far I am getting an error -
import hashlib
x = hashlib.md5(df2['SID'])
Traceback (most recent call last):
File "<ipython-input-60-44772f235990>", line 1, in <module>
x = hashlib.md5(df2['SubDiagnosisId'])
TypeError: object supporting the buffer API required
Upvotes: 0
Views: 481
Reputation: 704
Here's my attempt at this one:
hashes = df2.apply(lambda x:hashlib.md5((str(x[0])+str(x[1])).encode('utf8')).hexdigest(), axis=1)
Some explanation:
df2.apply
takes a function, in this case an anonymous lambda
function, as well as the axis over which we want to apply the function. In this case, axis=1
applies over each row.
Breakdown of the hashing function:
The anonymous function takes one argument x
, which consists of two columns. We break down x into x[0]
(the first column CID) and x[1]
(the second column SID).
Here, we have two choices. We can either convert the integers into strings and concatenate the strings as I've done here, or multiply the CID value by some constant that is at least max(SID)
. However, I think string concatenation may not be unique enough for this case. The better approach may be the following:
df.apply(lambda x:hashlib.md5(str(x[0]*1024+x[1]).encode('utf8')).hexdigest(), axis=1)
You noted that the max SID value is 583, so I chose the next available power of 2 as the multiplier. This effectively left-shifts all CID values by 10 bits so that all 10 LSB bits are now zero. Then we fill those LSB bits with SID values using addition.
Either way, the final representation needs to be an encoded byte string, hence the str(integer_stuff).encode('utf8')
part. Finally, we enclose that result inside hashlib.md5()
and call .hexdigest()
to retrieve the hexadecimal string representation of the hash.
Improvements to my approach as far as Pandas itself is concerned are welcome :) but I think my hashing approach itself is quite sound.
EDIT:
In order to join the result to the original DataFrame, try the following:
# Calculate the hashes. This gives you a Series.
hashes = df2.apply(lambda x:hashlib.md5((str(x[0])+str(x[1])).encode('utf8')).hexdigest(), axis=1)
# Create a DataFrame from the above Series
df_hash = pd.DataFrame(hashes, columns=['hash'])
# Join the hashes with the original DataFrame
df2 = df2.join(df_hash)
Tested with a short set of data, so it should work for you too :)
Upvotes: 1