Reputation: 17881

Detect most likely words from text without spaces / combined words

How could I detect and split words from a combined string?

Example:

"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]

Upvotes: 13

Answers (5)

Abhishek Sengupta

Reputation: 3291

Can see this example : But its written in scala. This can split anything you want when the sentence contains no space in between.

Nonspaced-Sentence-Tokenizer

Upvotes: 0

koga73

Reputation: 1022

I know this question is marked for Python but I needed a JavaScript implementation. Going off of the previous answers I figured I'd share my code. Seems to work decently.

function findWords(input){
    input = input.toLowerCase().replace(/\s/g, ""); //Strip whitespace

    var index = 0;
    var validWords = [];
    for (var len = input.length; len > 0; len--){ //Go backwards as to favor longer words
        var testWord = input.substr(index, len);
        var dictIndex = _dictionary.indexOf(testWord.replace(/[^a-z\']/g, "")); //Remove non-letters
        if (dictIndex != -1){
            validWords.push(testWord);
            if (len == input.length){
                break; //We are complete
            }
            var nextWords = findWords(input.substr(len, input.length - len)); //Recurse
            if (!nextWords.words.length){ //No further valid words
                validWords.pop();
            }
            validWords = validWords.concat(nextWords.words);
            if (nextWords.complete === true){
                break; //Cascade complete
            }
        }
    }
    return {
        complete:len > 0, //We broke which indicates completion
        words:validWords
    };
}

Note: "_dictionary" is expected to be an array of words sorted by frequency. I am using a wordlist from Project Gutenberg.

Upvotes: -2

user97370

Reputation:

Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

Upvotes: 12

interjay

Reputation: 110108

I don't know a library that does this, but it's not too hard to write if you have a list of words:

wordList = file('words.txt','r').read().split()
words = set( s.lower() for s in wordList )

def splitString(s):
    found = []

    def rec(stringLeft, wordsSoFar):
        if not stringLeft:
            found.append(wordsSoFar)
        for pos in xrange(1, len(stringLeft)+1):
            if stringLeft[:pos] in words:
                rec(stringLeft[pos:], wordsSoFar + [stringLeft[:pos]])

    rec(s.lower(), [])
    return found

This will return all possible ways to split the string into the given words.

Example:

>>> splitString('filesaveas')
[['file', 'save', 'as'], ['files', 'ave', 'as']]

Upvotes: 2

Max Shawabkeh

Reputation: 38603

I don't know of any library for it, but it shouldn't be hard to implement basic functionality.

Get a words list, like UNIX's words.
Stuff the contents of your word list into a trie.
Take the string you want to split and follow its path in the trie. Each time you reach a valid word, create a new branch that searches for a word starting from the point of the string that you got to. Once you finish your current branch, backtrack to the one you created, like in a depth first search.
Disambiguate the resulting lists manually, using heuristics or through a natural language parser.

Example:

Word: "filesaveasstring"
First valid word is "file". Try matching "saveas". First valid word is "save". Try matching "asstring". First valid word is "as". Try matching "string". First valid word is "string". Matched until end; put the [file save as string] into your results list.
Backtrack to matching "string" - no other possibilities. Backtrack to "asstring". First unvisited valid word is "ass". Try matching "tring". No possible matches. Backtrack to "asstring". No possible matches. Backtrack to "filesaveasstring".
First unvisited match is "files". Try to match "aveasstring". First match is "ave". Try matching "asstring" (same results as steps 2/3), adding [files ave as string] to your results list and backtrack to the start.
Try matching "filesaveasstring". No unvisited matches. Done.
Select the most likely from [[file save as string] [files ave as string]] using a heuristic or a natural language parser.

Upvotes: 8

Detect most likely words from text without spaces / combined words

Answers (5)

Related Questions