Curious
Curious

Reputation: 65

A good word splitter

I have a set of short strings (average length < 12). The strings are mostly sequence of English words (names, dict words etc). However there is no delimiter between the words. I want to split each string into individual words. I tried google but didn't find anything.

Is there any standard way to do that? Also where can I get dictionary which also includes name of person, along with other English words.

Please note: The strings might not adhere to grammatical rules of English.

Examples of Strings are given below:
dontdisturb
ilovejane
iamagoodperson

Upvotes: 1

Views: 386

Answers (1)

Alex Nevidomsky
Alex Nevidomsky

Reputation: 698

It is a known problem for Twitter content/hashtags, though there is no standard/universally accepted way to solve it. (I would also suggest changing the topic to "hashtag splitter" if it is your problem, then more people would be able to find it.)

The algorithm I would suggest is the one typically used for segmentation of Chinese (which has a very similar issue as you can imagine). Here is the idea:

1.Try finding all substrings that can be found in a dictionary, give them the highest score.

2.Then add sequences accepted by some English heuristic with a lower score.

3.And finally throw in individual letters or syllables found in the remainder, with the lowest score.

4.Use Viterbi algorithm (or here) to find the best non-overlapping coverage of the string with the highest score.

Upvotes: 1

Related Questions