Manish Patel
Manish Patel

Reputation: 4491

convert similar sound word parts

I'm having trouble searching for the right terms here to solve the below problem; I'm sure it's a done thing, I just can't find the right terms to express the problem!

I'm basically trying to create a classifier that will take word comparison outputs (e.g. some outputs from Levenstein distances) and decide whether the words are sufficiently different. An important input would probably be something like a soundex comparison. The trouble I'm having is creating the training set for the algorithm (an SVM in this case). I have a long list of names and I need to mutate them a bit (based on similar sounds within the word).

E.g. John and Jon would be a mutation to make, and I could label this in the test set as being equivalent. John and Johann have sufficiently different sound and letter distance to be considered different.

So I'm kinda asking for is a way to achieve a phoneme variation generator, but need to be able to retain the English lettering structure.

Even simple translation might suffice, like "f" could (sometimes) be replaced by "ph". I'm doing this in Java so any tips in that direction would be great too! Thanks.

EDIT

This is the closest I've come across so far: http://www.isi.edu/natural-language/people/hovy/papers/07IJCAI-spelling-variants.pdf

Upvotes: 0

Views: 119

Answers (1)

Debasis
Debasis

Reputation: 3750

I'm just thinking aloud.

Rule-based: Apply a rule-based system where you could use standard substitution rules such as 'ph' for 'f', and insertion rules such as insert an h between a vowel and a consonant.

Character n-gram alignment: Use a word alignment tool such as Giza++ to align character n-grams from parallel corpora such as Europarl. I guess you would be able to find interesting word spelling variations such as "house", "haus" etc. You can play with various values of n.

Bootstraping character n-gram alignment with rule-based: You might also want to use a combination of the two, in which you could, in principle, boost the probabilities of some alignments by using a set of external rules and heuristics.

Upvotes: 1

Related Questions