namezero
namezero

Reputation: 2293

Indexing strings and internationalization

I have recently come up with an indexing algorithm for finding duplicate customer records. Short in short, this all works very well.

However, my issue is that I'd like to find "Diviér" should match "Divier", or "Aether" should match "Æther". No problem, because removing diacritics is possible with libicu or boost::locale, and the problem uses wstring. However, here is my problem: Normalizing/Latinizing a word changes it's meaning in a way that matching the may no longer make sense. I would like some input on whether this would be acceptable for names...

Also, what if someone has a Chinese name? This wouldn't be normalizable in this way, would it?

Do you have any recommendations on how to approach this?

Upvotes: 0

Views: 52

Answers (1)

Udo Klein
Udo Klein

Reputation: 6882

You should look much more at the addresses and not to much into the names. In the end names can be very misleading. E.g. depending on the country the transcription of Chinese, Russion or Japanese characters may vary. Then sometimes names fields are to short to capture the full name of a person (especially common with Indish names) which leads to any kind of seemingly random abreviations. Sometimes people will ommit middle names, sometime they will not. And sometimes there are misspellings that give proper but different names.

So in my opinion the name should be the least important criterion in finding duplicates.

Upvotes: 1

Related Questions