Reputation: 966

Join group of words that are in a given list

My problem looks quite simple but I can't figure out a clean (and efficient) solution.

I have a list of tuples corresponding to common groups of words:

ngrams = [("data", "scientist"),
          ("machine", "learning"),
          ("c", "+"),
          ("+", "+"),
          ("c", "+", "+"),
          ("research", "and", "development"),
          ("research", "and")]

And a sentence:

"i am a data scientist . i do machine learning and c + + but no deep learning . i like research and development"

I would like to merge the common groups of words in a single token like that:

"i am a data_scientist . i do machine_learning and c_+_+ but no deep_learning . i like research_and_development"

I am sure there is an elegant way to do so but I haven't been able to find any..

If there were only 2-tuples, iterating on zip(sentence, sentence[:1] would do it, but I have up to 8-tuples in ngrams and this solution is not tractable!

Upvotes: 1

Answers (3)

Kirill Bulygin

Reputation: 3836

While Haldean Brown's answer is simpler, I think this is a more structured approach:

ngrams = [("data", "scientist"),
          ("machine", "learning"),
          ("c", "+"),
          ("+", "+"),
          ("c", "+", "+"),
          ("research", "and", "development"),
          ("research", "and")]
sent = """
    i am a data scientist . i do machine learning and c + + but no deep
    learning . i like research and development
"""

ngrams.sort(key=lambda x: -len(x))
tokens = sent.split()

out_ngrams = []
i_token = 0
while i_token < len(tokens):
    for ngram in ngrams:
        if ngram == tuple(tokens[i_token : i_token + len(ngram)]):
            i_token += len(ngram)
            out_ngrams.append(ngram)
            break
    else:
        out_ngrams.append((tokens[i_token],))
        i_token += 1

print(' '.join('_'.join(ngram) for ngram in out_ngrams))

Output:

i am a data_scientist . i do machine_learning and c_+_+ but no deep learning . i like research_and_development

ngrams after sorting:

[('c', '+', '+'),
 ('research', 'and', 'development'),
 ('data', 'scientist'),
 ('machine', 'learning'),
 ('c', '+'),
 ('+', '+'),
 ('research', 'and')]

That's needed to try to apply ("c", "+", "+") earlier than ("c", "+") (or, in general, to try to apply a sequence earlier than its prefixes). Actually it's possible that non-greedy things like [('c', '+'), ('+', 'a')] are more desirable than [('c', '+', '+'), ('a',)], but that's another story.

Upvotes: 1

Steve Hope

Reputation: 1

s = ''
seq = ("c", "+", "+")
print(s.join(seq))

Join group of words that are in a given list

Answers (3)

Related Questions