D. Rad
D. Rad

Reputation: 21

Generating possible sentences from a scrambled list of n-grams (python)

Sample input stream: [ ('t','h'), ('h','e'), ('e', ' '), (' ','f') , ('f','o'), ('o','x'), ('x',' '), (' ','a'), ('a','t'), ('t','e'), ('e', <p>) ]


Suppose you have a sentence {ABCABA}, where each letter is either a character or word, depending on tokenization.

Then your bag-of-bigrams is {(AB), (BC), (CA), (AB), (BA)}.

From here, I need an algorithm to list all the possible permutations of sentences with the same length as the original sentence, given these bigrams. Here, {ABCABA} (the original sequence) and (ABABCA) are both valid, possible sentences, but {ACBABA} is not. This example is for bigrams, but I also need this to work for any $n$. Any ideas?

Upvotes: 2

Views: 841

Answers (2)

jferard
jferard

Reputation: 8180

Here's a very straightforward solution. First, compute all n-grams; second, get all possible sublists of these n-grams and get all permutations of those sublists.

The n-grams (may be optional, given your sample input stream)

You can use a comprehension list. Take n times the list with a start from indices 0 to n-1: [sentence[k:] for k in range(n)]. For ABCABA and 3, you get [ABCABA, BCABA, CABA]. You just have to zip the lists and join the resulting tuples (note the star to unpack the arguments):

def ngrams(sentence, n): return ["".join(t) for t in zip(*[sentence[k:] for k in range(n)])]

This gives:

>>> ng = ngrams("ABCABA", 2)
>>> ng
['AB', 'BC', 'CA', 'AB', 'BA']

List sentences

You can use itertools, specifically combinations and permutations. The combinations function gives the "r length subsequences of elements from the input iterable":

>>> list(itertools.combinations(ng, 2))
[('AB', 'BC'), ('AB', 'CA'), ('AB', 'AB'), ('AB', 'BA'), ('BC', 'CA'), ('BC', 'AB'), ('BC', 'BA'), ('CA', 'AB'), ('CA', 'BA'), ('AB', 'BA')]

You have to take the combinations for every possible length. The permutations function will permute all this subsequences:

def sentences(sentence, n):
    ng = ngrams(sentence, n)
    for k in range(len(ng)):
        for c in itertools.combinations(ng, k):
            for p in itertools.permutations(c):
                yield("".join(p))

Or with a generator comprehension:

def sentences(sentence, n):
    ng = ngrams(sentence, n)
    return ("".join(p) for k in range(len(ng)) for c in itertools.combinations(ng, k) for p in itertools.permutations(c))

This gives 206 possibilities:

>>> list(sentences("ABCABA", 2))
['', 'AB', 'BC', 'CA', 'AB', 'BA', 'ABBC', 'BCAB', ..., 'ABBACABC', 'BABCCAAB', 'BABCABCA', 'BACABCAB', 'BACAABBC', 'BAABBCCA', 'BAABCABC']

Upvotes: 0

David Eisenstat
David Eisenstat

Reputation: 65458

Build a directed graph and then use recursion to enumerate all possible paths of length k. To wit,

def buildgraph(input, n):
    # n-1-gram to tokens that follow it
    graph = {
        tuple(input[i:(i + n - 1)]): set()
        for i in range(len(input) - n + 1)
    }
    for i in range(len(input) - n + 1):
        graph[tuple(input[i:(i + n - 1)])].add(input[i + n - 1])
    return graph


def continuations(graph, n, k, pathsofar):
    if len(pathsofar) == k:
        yield pathsofar
    elif len(pathsofar) < k:
        for token in graph[pathsofar[-(n - 1):]]:
            yield from continuations(graph, n, k, pathsofar + (token, ))


def allsentences(input, n, k):
    graph = buildgraph(input, n)
    for ngram in graph:
        yield from continuations(graph, n, k, ngram)


for sent in allsentences('abcaba', 2, 6):
    print(''.join(sent))

Upvotes: 1

Related Questions