Stelios M
Stelios M

Reputation: 160

Find longest sequence of common words from list of words in python

I searched a lot for a solution and I indeed have found similar questions. This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list. This answer gives back the longest common sequences of WORDS that MUST belong to all of the strings in the input list.

I am looking for a combination of the above solutions. That is, I want the longest sequence of common WORDS that might NOT appear in all of the words/phrases of the input list.

Here are some examples of what is expected:

['exterior lighting', 'interior lighting'] --> 'lighting'

['ambient lighting', 'ambient light'] --> 'ambient'

['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] --> 'turn signal lamp'

['ambient lighting', 'infrared light'] --> ''

Thank you

Upvotes: 0

Views: 420

Answers (2)

Stelios M
Stelios M

Reputation: 160

A rather crude solution but I think it works:

from nltk.util import everygrams
import pandas as pd

def get_word_sequence(phrases):

    ngrams = []

    for phrase in phrases:        
        phrase_split = [ token for token in phrase.split()]
        ngrams.append(list(everygrams(phrase_split)))

    ngrams = [i for j in ngrams for i in j]  # unpack it    

    counts_per_ngram_series = pd.Series(ngrams).value_counts()

    counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})

    # discard the pandas Series
    del(counts_per_ngram_series)

    # filter out the ngrams that appear only once
    counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]

    if not counts_per_ngram_df.empty:    
        # populate the ngramsize column
        counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()

        # sort by ngramsize, ngram_char_length and then by count
        counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])

        # get the top ngram
        top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)

        return top_ngram

    return ''

Upvotes: 0

marksman123
marksman123

Reputation: 399

this code will also sort your desired list by the most common word in your list. it will count the amount of every word in your list, and than will cut the words that appeared only once and sort it.

lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] 
d = {}
d_words={}
for i in lst:
    for j in i.split():
      if j in d:
          d[j] = d[j]+1
      else:
          d[j]= 1
for k,v in d.items():
    if v!=1:
        d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)

Upvotes: 1

Related Questions