Lock
Lock

Reputation: 5522

How to increase the speed of a HashSet when used to ensure no duplicate records are generated

I need to generate roughly 500k unique codes for each of our customers.

The code runs in a few minutes, however, I need to make sure there are no clashes, so I added this logic in to use a HashTable. This has resulted in the peformance taking over 8 hours to generate just 200k.

Is there anything I can use to better the performance here?

The performance bottleneck is the HashSet lookup (around line 8). What other options are there to increase the performance of making sure there are no duplicates?

public string GenerateUniqueReferralCode(CustomerObj customer, HashSet<string> assignedCodes)
{
    bool isUnique = false;
    string code = String.Empty;
    do
    {
        code = GenerateReferralCode(customer);
        if (!assignedCodes.Contains(code))
        {
            isUnique = true;
        }
        else
        {
            isUnique = false;
        }
    } while (!isUnique);
    return code;
}

public string GenerateReferralCode(CustomerObj customer)
{
    var code = String.Empty;
    //replace special characters and only keep alpha

    var name = customer.Profile.FirstName + customer.Profile.LastName;
    name = new String(name.Where(Char.IsLetter).ToArray());

    if (name.Length > 3)
    {
        code += name.Substring(0, 4).ToUpperInvariant();
    }
    else
    {
        code += customer.Profile.FirstName.Substring(0, customer.Profile.FirstName.Length).ToUpperInvariant();
    }

    code += CreateMD5(customer.Profile.Email + DateTime.UtcNow.ToString());

    code = code.Substring(0, 7);

    return code;
}

Upvotes: 2

Views: 72

Answers (1)

Damien_The_Unbeliever
Damien_The_Unbeliever

Reputation: 239684

DateTime.UtcNow changes at a glacial rate compared to how fast a modern processor is, and this appears to be your only source of randomness in generating your codes1. There's also no way to recover this value and validate the MD52 hash anyway so I'm not sure what value it's adding.

Instead, use a cryptographic random number generator to generate some real randomness and use that in your codes. But don't forget to include the raw value in the code if you'll need to validate the hash.


1Which means you're highly likely to spend ages looping creating "new" codes that precisely match the previous code until the time changes.

2N.B. you should not be using MD5 in new work either...

Upvotes: 4

Related Questions