Reputation: 3
We have an old, custom, C# hashing algorithm we use to mask e-mail addresses for PII purposes. I'm trying to build a Python version of this algorithm, but I'm struggling handling the differences in how C# and Python handle bytes/byte arrays, thus producing the wrong hash value. For reference, this is Python 2.7, but a Python 3+ solution would work just as well.
C# code:
using System.Text;
using System.Security;
using System.Security.Cryptography;
public class Program
{
public static void Main()
{
string emailAddressStr = "[email protected]";
emailAddressStr = emailAddressStr.Trim().ToLower();
SHA256 objCrypt = new SHA256Managed();
byte[] b = (new ASCIIEncoding()).GetBytes(emailAddressStr);
byte[] bRet = objCrypt.ComputeHash(b);
string retStr = "";
byte c;
for (int i = 0; i < bRet.Length; i++)
{
c = (byte)bRet[i];
retStr += ((char)(c / 10 + 97)).ToString().ToLower();
retStr += ((char)(c % 10 + 97)).ToString().ToLower();
}
Console.WriteLine(retStr);
}
}
The (correct) value that gets returned is uhgbnaijlgchcfqcrgpicdvczapepbtifiwagitbecjfqalhufudieofyfdhzera
Python translation:
import hashlib
emltst = "[email protected]"
emltst = emltst.strip().lower()
b = bytearray(bytes(emltst).encode("ascii"))
bRet = bytearray(bytes(hashlib.sha256(b)))
emailhash=""
for i in bRet:
c = bytes(i)
emailhash = emailhash + str(chr((i / 10) + 97)).lower()
emailhash = emailhash + str(chr((i % 10) + 97)).lower()
print(emailhash)
The (incorrect) value I get here is galfkejhfafdfedchcgfidhcdclbjikgkbjjlgdcgedceimaejeifakajhfekceifggc
The "business end" of the code is in the loop where c
is not translating well between languages. C# produces a numeric value for the calculation, but in Python, c
is a string (so I'm using i
). I've stepped through both sets of code and I know that I'm producing the same hash value right before the loop. I hope someone here might be able help me out. TIA!
EDIT (2020-04-09)
Oguz Ozgul has a good solution below. I found a savvy programmer at work who suggested this working, Python 3 solution (this contains code for the broader solution of ingesting a list of e-mails and using PySpark to write a table):
myfile=sys.argv[1]
with open(myfile) as fql:
insql=fql.read()
emails=[]
emails=insql.splitlines()
mytable=sys.argv[2]
def getSha256Hash(email):
b = bytearray(bytes(email, 'ascii'))
res = hashlib.sha256(b)
bRet = bytearray.fromhex(res.hexdigest())
emailhash=""
for i in bRet:
c1 = i / 10 + 97
c2 = i % 10 + 97
c1 = int(c1)
c2 = int(c2)
emailhash = emailhash + str(chr(c1)).lower()
emailhash = emailhash + str(chr(c2)).lower()
return(emailhash)
###################################
emailhashes = []
isascii = lambda s: len(s) == len(s.encode())
for e in emails:
e = e.strip().lower()
if isascii(e) == True:
emailhashret = getSha256Hash(e)
emailhashes.append(emailhashret)
findf = spark.createDataFrame(emailhashes, StringType())
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
findf.repartition(1).write.format("parquet").mode("overwrite").saveAsTable(mytable)
Upvotes: 0
Views: 86
Reputation: 7187
Here you go (python 3.0)
Notes:
import hashlib
emltst = b"[email protected]"
emltst = emltst.strip().lower()
hashAlgorithm = hashlib.sha256()
hashAlgorithm.update(emltst)
# Thanks to Mark Meyer for pointing out.
# bytearray(bytes( are redundant
bRet = hashAlgorithm.digest()
emailhash=""
for i in bRet:
c = bytes(i)
emailhash = emailhash + str(chr((i // 10) + 97)).lower()
emailhash = emailhash + str(chr((i % 10) + 97)).lower()
print(emailhash)
OUTPUT:
uhgbnaijlgchcfqcrgpicdvczapepbtifiwagitbecjfqalhufudieofyfdhzera
Upvotes: 1