user12314164
user12314164

Reputation:

How to count phrases in text and extract most frequent ones?

I have a dataset df with column text:

text
the main goal is to develop a smart calendar
the main goal is to develop a smart calendar
the main goal is to develop a chat bot
it is clear that the main goal is to develop a product
ai products for department A
launching ai products for department B

as you see there are a lot of common phrases in texts. How could I detect them and extract most common ones (lets say which appear 2 or more times). so desired output is:

text                                cnt
the main goal is to develop          4
ai products for department           2
ai products for department           2

the reason why there is the main goal is to develop was caught but the main goal is to and so on were not is because it is the longest out of them

How could I do that?

Upvotes: 0

Views: 377

Answers (1)

lijqhs
lijqhs

Reputation: 106

You can use N-gram to do this. The main idea is:

  1. For each sentence, get n-gram, such as 2-gram (bi-gram) of 'the main goal is to develop a smart calendar': ['the main', 'main goal', 'goal is', 'is to', 'to develop', 'develop a', 'a smart', 'smart calendar']
  2. Get all these pieces of phrases with different n-gram, with n range from 1 to len(sentence)
  3. Count their occurrence, store the count and length to a dictionary
  4. Sort the result with count and length

With python, you can go like this:

text=['the main goal is to develop a smart calendar',
        'the main goal is to develop a smart calendar',
        'the main goal is to develop a chat bot',
        'it is clear that the main goal is to develop a product',
        'ai products for department A',
        'launching ai products for department B']


def get_ngram(word_list, n):
    ngram_list = [' '.join(word_list[i:i+n]) for i in range(len(word_list) - n + 1)]
    print(ngram_list)
    return ngram_list


def get_ngram_pieces(text):
    text_pieces = []
    for sentence in text:
        word_list = sentence.split()
        print(word_list)
        for n in range(1, len(word_list) + 1):
            ngram_list = get_ngram(word_list, n)
            text_pieces.extend(ngram_list)

    return text_pieces
    

def get_count(text_pieces):
    keys = set(text_pieces)
    phrase_dict = {}
    for key in keys:
        phrase_dict[key] = (text_pieces.count(key), len(key.split()))
    return phrase_dict

all_pieces = get_ngram_pieces(text)
phrase_dict = get_count(all_pieces)
phrase_dict_sorted = dict(sorted(phrase_dict.items(), key=lambda item: item[1], reverse=True))

The top 10 of phrase_dict_sorted is

is,5,1
the main goal is to develop a,4,7
the main goal is to develop,4,6
main goal is to develop a,4,6
goal is to develop a,4,5
the main goal is to,4,5
main goal is to develop,4,5
the main goal is,4,4
goal is to develop,4,4
is to develop a,4,4

Upvotes: 1

Related Questions