Reputation: 35196

What is the best way to determine duplicate credit card numbers without storing them?

I run a website where we mark certain accounts as scammers, and "flag" their account and all credit cards used as being bad. We don't store actual credit card values, but are storing a checksum/MD5 algorithm of it instead.

We are hitting collisions all the time now. What is the best way to store these values - non reversible, but able to do comparisons on future values.

I thought MD5 would be the best, but we've got a debate going on here...

Upvotes: 6

Answers (11)

Timo

Reputation: 8680

As others have said, HMAC should be the way to go.

HMAC-SHA-256 with a proper key should:

Avoid collisions.
Avoid retrieval of the credit card number from the stored value.
Prevent an attacker from performing the same computation (on all possible credit card numbers, to find a matching value).

But there is one more very important thing:

It is with good reason that you are not storing the credit card numbers. Even if you could be 100% sure that you are using proper encryption, you probably still would not store credit card numbers. Why? For one thing, because the key could be leaked.

So you store hashes, so that the credit card number cannot be retrieved. ...Right?

Well, if you use a plain hash, a simple rainbow table with hashes of all possible credit card numbers gives away all the original data that you presumably did not store. Oops. But this you knew by now.

So we try to do better. Let's say using individual salts is better, and using HMAC is the best approach we know.

Consider the following scenario:

Take a 16-digit card number.
First 6 digits (Bank Identification Number) are guessed by trying a few common BINs.
Last 4 digits are visible in masked card number, which you are allowed to store. (You might not have this stored, which helps.)
1 digit is calculated (Luhn).

This leaves 5 digits to be brute-forced. That is a meager 100'000 attempts.

If we have used the individual salts, it's game over. We can simply brute-force each individual card number at an average of 50'000 attempts.

If we have used HMAC, we appear to be safe. But remember... we choose not to store encrypted card numbers, because even with perfect encryption, the key could be leaked. Guess what. Our HMAC key can be leaked just the same. With the key, again, we can brute-force each individual card number at an average of 50'000 attempts. So a leaked key gives us the credit card numbers, just as it would if we had stored encrypted card numbers.

As such, because of the low entropy of credit card numbers, storing hashes does not add much security compared to encrypted values (yet PCI limits the key rotation requirement to encryption).

A bit of perspective:

Ok, we're assuming a leaked key here. Extreme. But then again, so does PCI as part of their reasoning to forbid you from storing credit card numbers, so we should at least consider it.

True, I did not take into account the multiple guesses to find the BIN. It should be a small constant, though. Or we could limit ourselves to one BIN.

Definitely, a PCI auditor may be more forgiving than I am.

Yes, if you do not store the masked card number, you are a factor 10'000 safer. This helps a lot. Use it to your advantage. Still, if 50K attempts are doable, 500M may be doable, too. It's not enough to make me consider the data secure, in the context of a compromised key.

Conclusion:

Use HMAC-SHA-256. Understand the risk. Store as little as possible. Protect your keys vigilantly. Spend a fortune on a Hardware Security Module :-)

Upvotes: 3

Paul

Reputation: 31

People pointing out that a hash is "broken" are missing the point, perhaps regurgitating something they've heard without understanding what it means. When people talk about hashes being 'broken' they typically mean that it is possible to easily generate an alternate payload that has computes to the same hash.

This 'breaks' the hash but only for the specific purpose of using a hash to verify data is what it's supposed to be.

That isn't the important here, ie someone managing to create an alternate datastream that happens to hash down to the same value as one of the credit cards doesn't achieve anything meaningful or useful in terms of an attack vector.

The risk with hashes here is that the problem space for credit card numbers is pretty low and rainbow tables for them would be pretty cheap and easy to generate.

Adding a salt would add a bit of protection against already generated rainbow tables for pure card numbers but the extent to which it offers any real protection depends on how 'secret' the salt would remain in the case you are compromised. If the salt is exposed then new rainbow tables can then be cheaply generated and it's all over.

Given that the salt needs to be available to the application for it to perform checks against the blacklist there's a good chance someone compromising the blacklist data will also be able to get to the salt. If you have multiple servers you can mitigate that to some degree by ensuring both the salt and the data aren't in the same 'place' so an exposure of one server won't give someone all of the parts they need. (Similarly for backups don't store the data and the salt on the same media where someone can walk away with one tape and get everything). The salt only adds some protection while it is secret (in this type use).

If you have the resources to do it securely then I think that is the route to go. If you are getting a significant number of collisions on any reasonable hash function you must be doing a lot of volume. (In fact I'm highly surprised collisions would be a problem even then, any reasonable hash function should provide diverse results over a small problem space like this).

Upvotes: 3

Kris

Reputation: 1398

As Henri already mentioned above (+1), the right solution is to use Message Authentication Code such as HMAC with a secret key. This is exactly the "secret salt" someone mentioned before. (BTW. Salts are always public).

Use standard construction such as HMAC-SHA-256 (RFC2104, FIPS-198a), keep the key secret and store the results (authentication tags) in a database.

The larger digest size (256 bits) of SHA-256 should prevent any collisions from happening, SHA-256 is a fairly good hash function and probability of random collisions is 2^-128, so if you ever encounter a collision in your system, please, let me know! :)

Upvotes: 2

Björn

Reputation: 29381

MD5 is NOT the way to go since it's broken. Quote Bruce Schneier: "[w]e already knew that MD5 is a broken hash function" and that "no one should be using MD5 anymore."

I.e. use SHA512 or SHA256 as someone already proposed.

Upvotes: 2

Henri

Reputation: 5113

Dont bother doing salts, just use HMACs. I know it's kind of an abuse, but then you get a decent keyed hash, so you can prevent collisions and rainbow table attacks.

The nice thing here is that even if the key leaks, nobody can decrypt it. The best thing that works for HMACs is brute force. Actually, the key here is a salt as mentioned earlier. The nice thing here is that the algorithm is a little better than the usual salting stuff done by most non-security programmers.

Upvotes: 1

John Gietzen

Reputation: 49544

A cryptographically secure hash would work. (SHA512 or SHA256 would be OK)

However, I would use a fairly secret salt that is not stored along with the cards (to prevent any sort of rainbow table attack).

PS:
Rainbow table attacks against credit cards could be particularlly effective, since the total size of the plain-text-space is quite small due to the limited character set, the fixed size, and the check digits.

PPS:
You can't use a random salt for each entry, because you would never be able to feasibly check duplicates. Salts are used to prevent collisions, whereas we are specifically looking for a collision in this instance.

Upvotes: 16

David R Tribble

Reputation: 12204

Perhaps you can store two different hashes of the card number. The chances that both hashes will result in collisions is practically zero.

Upvotes: 4

meklarian

Reputation: 6625

It isn't sufficiently safe to just use a good Hash algorithm. If your list is stolen, your stored hashes can be used to retrieve working card information. The actual schema-space for credit card numbers is small enough that a determined attacker can pre-calculate many of the possible hashes ahead of time as well, and this may have other implications for your system if there is an intrusion or an inside-job.

I recommend you use a salt and also calculate a 2nd value to be added to the salt based on a formula involving each digit of the card number and the first salt value. This assures that if you lose control of either part, you still have reasonable uniqueness that renders ownership of the list useless. The formula should not be heavily weighted toward the first 6 digits of the card (BIN number), though, and no trace of the formula should be stored in the same location as either the salt or the final hash.

Consider the anatomy of a 16-digit credit card number:

6 digit BIN (Bank Identification Number)
9 digit Account Number
1 digit Luhn Checksum

BIN lists are well known within the processing industry and are not too difficult to assemble for those with access to an illicit list of card numbers. The number of valid BINs is further diminished by the assigned space for each issuer.

Visa - Starts with 4
American Express - Starts with 34 / 37
MasterCard - Starts with 5
Discover/CUP - Starts with 6
Diner's Club - Starts with 35
etc.

Note that some of the assigned BIN information within each issuer category is also sparse. If an attacker is aware of where most of your customers are located, then that will cut down the uniqueness considerably, as BIN information is assigned on a per-bank basis. An attacker that already has an account issued by a small bank in a wealthy neighborhood could just get an account and use the BIN as a starting point on his own card.

The checksum digit is calculated with a well-known formula, so that is immediately discardable as a source of unique data.

Armed with a handful of BINs worth targeting, an attacker has to check 9 digits at a time for each BIN set. This is 1 Billion Checksums and Hash Operations per set. I don't have any benchmarks handy, but I'm pretty sure 1 Million Hash operations per minute is not unreasonable for MD5 or any flavor of SHA on a suitably powerful machine. This amounts to less than a day to crack all matches under a given BIN.

Finally, you might consider storing a timestamp or visitor token (IP/subnet) with your hashes as well. It is nice to catch duplicate card numbers, but also consider the ramifications of someone stuffing your system with bogus card numbers. At some point you need to decide on a trade-off between blocking card numbers that you know are invalid, and also give yourself a mechanism to identify and repair misuse.

For example, a disgruntled employee could be stealing card information on his own and then use your hash mechanism against you by inserting valid hashes into your card number blacklist to block repeat business. It is quite expensive to undo this if you are just storing a hash- everything is opaque once it has been converted to a hash. With this in mind, give yourself a method to identify the source of the hash as well.

Upvotes: 4

Mark Snidovich

Reputation: 1055

Using the strongest hash possible is usually good. Speed is not of the essence and slowness actually works against anyone trying a brute force reversal of your hashed values.

I like whirlpool, personally - if you're using PHP check out the supported algorithms at the hash function docs

Whirlpool returns a string 128 characters long, but you don't have to store all of it necessarily. The first 32 or 64 chars would suffice. You could also consider sha512 or sha284.

Upvotes: 1

matt b

Reputation: 139931

If you are finding collisions with MD5, why not use a better algorithm such as SHA1 or SHA256?

Upvotes: 2

Niteriter

Reputation: 140

Use SHA1, hash collisions are yet to be found.

Upvotes: 3

What is the best way to determine duplicate credit card numbers without storing them?

Answers (11)

Related Questions