Ian
Ian

Reputation: 3908

Reconstruct input string given ngrams of that string

Given a string, e.g. i am a string.

I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.

from nltk import ngrams 

s = 'i am a string'
for n in range(1, 3):
    for grams in ngrams(s.split(), n):
        print(grams)

Gives the output:

('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')

Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).

[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]

The question is similar to this one, though with an additional layer of complexity.

Working solution - adapted from here.

I have a working solution, but it's really slow for longer strings.

def get_ngrams(s, min_=1, max_=4):
    token_lst = []
    for n in range(min_, max_):
        for idx, grams in enumerate(ngrams(s.split(), n)):
            token_lst.append(' '.join(grams))
    return token_lst 

def to_sum_k(s):
    for len_ in range(1, len(s.split())+1):
        for i in itertools.permutations(get_ngrams(s), r=len_):
            if ' '.join(i) == s:
                print(i)

to_sum_k('a b c')

Upvotes: 0

Views: 400

Answers (2)

user4668606
user4668606

Reputation:

EDIT:
This answer was based on the assumption that the question was to reconstruct an unknown unique string based on it's ngrams. I'll leave it active for anyone interested in this problem. The actual answer for the actual problem as clarified in the comments can be found here.
EDIT END

In general no. Consider e.g. the case n = 2 and s = "a b a b". Then your ngrams would be

[("a"), ("b"), ("a", "b"), ("b", "a")]

The set of strings that generate this set of ngrams in this case however would be all that may be generated by

(ab(a|(ab)*a?))|(ba(b|(ba)*b?)

Or n = 2, s = "a b c a b d a", where "c" and "d" may be arbitrarily ordered within the generating strings. E.g. "a b d a b c a" would also be a valid string. In addition the same issue as above arises and an arbitrary number of strings can generate the set of ngrams.

That being said there exists a way to test whether a set of ngrams uniquely identifies a string:
Consider your set of strings as a description of a non-deterministic state-machine. Each ngram can be defined as a chain of states where the single characters are transitions. As an example for the ngrams [("a", "b", "c"), ("c", "d"), ("a", "d", "b")] we would build the following state-machine:

0 ->(a) 1 ->(b) 2 ->(c) 3
0 ->(c) 3 ->(d) 4
0 ->(a) 1 ->(d) 5 ->(b) 6

Now perform a determinization of this state-machine. Iff there exists a unique string that can be reconstructed from the ngrams, the state-machine will have a longest transition-chain that doesn't contain any cycles and contains all ngrams we built the original state-machine from. In this case the original string is simply the individual state-transitions of this path joined back together. Otherwise there exist multiple strings that can be built from the provided ngrams.

Upvotes: 3

user4668606
user4668606

Reputation:

While my previous answer assumed that the problem was to find an unknown string based on it's ngrams, this answer will deal with the problem of finding all ways to construct a given string using it's ngrams.

Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:

import numpy

def generate_sums(l, n, intermediate):
    if l == 0:
        yield intermediate
    elif l < 0:
        return
    else:
        for i in range(1, n + 1):
            yield from generate_sums(l - i, n, intermediate + [i])

def combinations(s, n):
    words = s.split(' ')
    for c in generate_sums(len(words), n, [0]):
        cs = numpy.cumsum(c)
        yield [words[l:u] for (l, u) in zip(cs, cs[1:])]

EDIT:
As pointed out by @norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT

If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:

def build_state_machine(s, n):
    next_state = 1
    transitions = {}
    for ng in ngrams(s.split(' '), n):
        state = 0
        for word in ng:
            if (state, word) not in transitions:
                transitions[(state, word)] = next_state
                next_state += 1

            state = transitions[(state, word)]

     return transitions

def combinations(s, n):
    transitions = build_state_machine(s, n)
    states = [(0, set(), [], [])]

    for word in s.split(' '):
        new_states = []
        for state, term_visited, path, cur_elem in states:
            if state not in term_visited:
                new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
            if (state, word) in transitions:
                new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))

        states = new_states

   return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]

As an example the following state machine would be generated for the string "a b a":

state machine

Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

Upvotes: 1

Related Questions