Reputation: 2565
Suppose I've got a bunch of similar strings with noise, mainly words wrongly connected/disconnected. Like:
"Once more unto the breach, dear friends. Once more!"
"Once more unto the breach , dearfriends. Once more!"
"Once more unto the breach, de ar friends. Once more!"
"Once more unto the breach, dear friends. Once more!"
How can I normalize everyone into the same set of words? Namely
["once" "more" "unto" "the" "breach" "dear" "friends" "once" "more"]
Thanks!
Upvotes: 1
Views: 77
Reputation: 16039
A bit of a crazy idea, and I am only suggesting it because I was teaching the algorithm I am about to propose to my students this week.
Remove all whitespace from your sentences, e.g. de ar friends
becomes dearfriends
. There exists a quadratic time, linear space dynamic programming algorithm to split the no-whitespace string into the most likely sequence of words. That algorithms is discussed here and here--- the second solution is what I have in mind. The assumption here is you have a good model of what is a word and that it takes constant time to query that model.
Upvotes: 1
Reputation: 22496
Here are a few pointers. I think you will have to eventually end up writing a set of routines/functions to fix all the various types of irregularities that you encounter.
The good news is that you can incrementally add to your set of "fixes" and keep improving the parser.
I had to do something similar, and I found this post by Peter Norvig extremely useful. (Note that it is in Python.)
This function, in particular has the ideas that you need: splitting, deleting, transposing, and inserting the irregular words to 'correct' them.
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
The above is one snippet from Norvig's spelling corrector
Even if you cannot use the code as-is, the core idea is applicable to your case: You take a token ("word"), which is the irregular word in your case, try different tweaks to see if it belongs to a big dictionary of known and accepted words.
Hope that helps.
Upvotes: 3