Reputation: 966
My problem looks quite simple but I can't figure out a clean (and efficient) solution.
I have a list of tuples corresponding to common groups of words:
ngrams = [("data", "scientist"),
("machine", "learning"),
("c", "+"),
("+", "+"),
("c", "+", "+"),
("research", "and", "development"),
("research", "and")]
And a sentence:
"i am a data scientist . i do machine learning and c + + but no deep learning . i like research and development"
I would like to merge the common groups of words in a single token like that:
"i am a data_scientist . i do machine_learning and c_+_+ but no deep_learning . i like research_and_development"
I am sure there is an elegant way to do so but I haven't been able to find any..
If there were only 2-tuples, iterating on zip(sentence, sentence[:1]
would do it, but I have up to 8-tuples in ngrams
and this solution is not tractable!
Upvotes: 1
Views: 101
Reputation: 3836
While Haldean Brown's answer is simpler, I think this is a more structured approach:
ngrams = [("data", "scientist"),
("machine", "learning"),
("c", "+"),
("+", "+"),
("c", "+", "+"),
("research", "and", "development"),
("research", "and")]
sent = """
i am a data scientist . i do machine learning and c + + but no deep
learning . i like research and development
"""
ngrams.sort(key=lambda x: -len(x))
tokens = sent.split()
out_ngrams = []
i_token = 0
while i_token < len(tokens):
for ngram in ngrams:
if ngram == tuple(tokens[i_token : i_token + len(ngram)]):
i_token += len(ngram)
out_ngrams.append(ngram)
break
else:
out_ngrams.append((tokens[i_token],))
i_token += 1
print(' '.join('_'.join(ngram) for ngram in out_ngrams))
Output:
i am a data_scientist . i do machine_learning and c_+_+ but no deep learning . i like research_and_development
ngrams
after sorting:
[('c', '+', '+'),
('research', 'and', 'development'),
('data', 'scientist'),
('machine', 'learning'),
('c', '+'),
('+', '+'),
('research', 'and')]
That's needed to try to apply ("c", "+", "+")
earlier than ("c", "+")
(or, in general, to try to apply a sequence earlier than its prefixes). Actually it's possible that non-greedy things like [('c', '+'), ('+', 'a')]
are more desirable than [('c', '+', '+'), ('a',)]
, but that's another story.
Upvotes: 1
Reputation: 1
s = ''
seq = ("c", "+", "+")
print(s.join(seq))
More in on join method: Python docs
ttps://docs.python.org/3/library/stdtypes.html?highlight=join#str.join
Upvotes: 0
Reputation: 12721
You can build a list of replacement strings from your words in ngrams
:
replace = [" ".join(x) for x in ngrams]
And then, for each element in that list, use str.replace
:
for r in replace:
sentence = sentence.replace(r, r.replace(" ", "_"))
There might be a more one-liner-y way to do it, but that seems relatively terse and easy to understand to me.
Upvotes: 1