Reputation:
I have a dataset df with column text:
text
the main goal is to develop a smart calendar
the main goal is to develop a smart calendar
the main goal is to develop a chat bot
it is clear that the main goal is to develop a product
ai products for department A
launching ai products for department B
as you see there are a lot of common phrases in texts. How could I detect them and extract most common ones (lets say which appear 2 or more times). so desired output is:
text cnt
the main goal is to develop 4
ai products for department 2
ai products for department 2
the reason why there is the main goal is to develop
was caught but the main goal is to
and so on were not is because it is the longest out of them
How could I do that?
Upvotes: 0
Views: 377
Reputation: 106
You can use N-gram to do this. The main idea is:
['the main', 'main goal', 'goal is', 'is to', 'to develop', 'develop a', 'a smart', 'smart calendar']
n
range from 1
to len(sentence)
With python, you can go like this:
text=['the main goal is to develop a smart calendar',
'the main goal is to develop a smart calendar',
'the main goal is to develop a chat bot',
'it is clear that the main goal is to develop a product',
'ai products for department A',
'launching ai products for department B']
def get_ngram(word_list, n):
ngram_list = [' '.join(word_list[i:i+n]) for i in range(len(word_list) - n + 1)]
print(ngram_list)
return ngram_list
def get_ngram_pieces(text):
text_pieces = []
for sentence in text:
word_list = sentence.split()
print(word_list)
for n in range(1, len(word_list) + 1):
ngram_list = get_ngram(word_list, n)
text_pieces.extend(ngram_list)
return text_pieces
def get_count(text_pieces):
keys = set(text_pieces)
phrase_dict = {}
for key in keys:
phrase_dict[key] = (text_pieces.count(key), len(key.split()))
return phrase_dict
all_pieces = get_ngram_pieces(text)
phrase_dict = get_count(all_pieces)
phrase_dict_sorted = dict(sorted(phrase_dict.items(), key=lambda item: item[1], reverse=True))
The top 10 of phrase_dict_sorted
is
is,5,1
the main goal is to develop a,4,7
the main goal is to develop,4,6
main goal is to develop a,4,6
goal is to develop a,4,5
the main goal is to,4,5
main goal is to develop,4,5
the main goal is,4,4
goal is to develop,4,4
is to develop a,4,4
Upvotes: 1