Reputation: 21
Sample input stream: [ ('t','h'), ('h','e'), ('e', ' '), (' ','f') , ('f','o'), ('o','x'), ('x',' '), (' ','a'), ('a','t'), ('t','e'), ('e', <p>) ]
Suppose you have a sentence {ABCABA}
, where each letter is either a character or word, depending on tokenization.
Then your bag-of-bigrams is {(AB), (BC), (CA), (AB), (BA)}
.
From here, I need an algorithm to list all the possible permutations of sentences with the same length as the original sentence, given these bigrams. Here, {ABCABA}
(the original sequence) and (ABABCA)
are both valid, possible sentences, but {ACBABA}
is not. This example is for bigrams, but I also need this to work for any $n$
. Any ideas?
Upvotes: 2
Views: 841
Reputation: 8180
Here's a very straightforward solution. First, compute all n-grams; second, get all possible sublists of these n-grams and get all permutations of those sublists.
You can use a comprehension list. Take n
times the list with a start from indices 0
to n-1
: [sentence[k:] for k in range(n)]
. For ABCABA
and 3
, you get [ABCABA, BCABA, CABA]
. You just have to zip the lists and join the resulting tuples (note the star to unpack the arguments):
def ngrams(sentence, n): return ["".join(t) for t in zip(*[sentence[k:] for k in range(n)])]
This gives:
>>> ng = ngrams("ABCABA", 2)
>>> ng
['AB', 'BC', 'CA', 'AB', 'BA']
You can use itertools
, specifically combinations
and permutations
. The combinations
function gives the "r length subsequences of elements from the input iterable":
>>> list(itertools.combinations(ng, 2))
[('AB', 'BC'), ('AB', 'CA'), ('AB', 'AB'), ('AB', 'BA'), ('BC', 'CA'), ('BC', 'AB'), ('BC', 'BA'), ('CA', 'AB'), ('CA', 'BA'), ('AB', 'BA')]
You have to take the combinations for every possible length. The permutations
function will permute all this subsequences:
def sentences(sentence, n):
ng = ngrams(sentence, n)
for k in range(len(ng)):
for c in itertools.combinations(ng, k):
for p in itertools.permutations(c):
yield("".join(p))
Or with a generator comprehension:
def sentences(sentence, n):
ng = ngrams(sentence, n)
return ("".join(p) for k in range(len(ng)) for c in itertools.combinations(ng, k) for p in itertools.permutations(c))
This gives 206 possibilities:
>>> list(sentences("ABCABA", 2))
['', 'AB', 'BC', 'CA', 'AB', 'BA', 'ABBC', 'BCAB', ..., 'ABBACABC', 'BABCCAAB', 'BABCABCA', 'BACABCAB', 'BACAABBC', 'BAABBCCA', 'BAABCABC']
Upvotes: 0
Reputation: 65458
Build a directed graph and then use recursion to enumerate all possible paths of length k. To wit,
def buildgraph(input, n):
# n-1-gram to tokens that follow it
graph = {
tuple(input[i:(i + n - 1)]): set()
for i in range(len(input) - n + 1)
}
for i in range(len(input) - n + 1):
graph[tuple(input[i:(i + n - 1)])].add(input[i + n - 1])
return graph
def continuations(graph, n, k, pathsofar):
if len(pathsofar) == k:
yield pathsofar
elif len(pathsofar) < k:
for token in graph[pathsofar[-(n - 1):]]:
yield from continuations(graph, n, k, pathsofar + (token, ))
def allsentences(input, n, k):
graph = buildgraph(input, n)
for ngram in graph:
yield from continuations(graph, n, k, ngram)
for sent in allsentences('abcaba', 2, 6):
print(''.join(sent))
Upvotes: 1