Emma Lim
Emma Lim

Reputation: 161

To identify repeated words and get the distance between two words as a list in python

A = "AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM"
print Repeat_Letter : [PPL:16, JSS:10]   --> What I want 

String A is a string that lists random characters that I suggested. Some of them are repetitive string. Repeated string within long strings is "PPL" and "JSS" in string A, respectively. 16 is a distance between the letter "PPL", and 10 is the distance between "JSS". And finally, the goal is to automatically determine repeated words and to express the distance between these words as a list in python.

trigrams = [A[i:i+3] for i in range(len(A)-2)]
counts = collections.Counter(trigrams)
repeated = [trigram for trigram, count in counts.items() if count > 1]

Through this, I checked which word is repeated. However, I'm wondering how to get the distance of these discriminated words. For example, we don't know how to get the distance between "PPL" and other "PPL".

Upvotes: 2

Views: 428

Answers (3)

sharath k
sharath k

Reputation: 54

This uses find and string slicing method

def find_distance_in_dups(string,length):
  dict_words={}
  for i in range(len(string)-length-1):
    word = string[i:length+i]
    distance=string[string.find(word)+length:].find(word)+1
    #print(distance)
    if distance > 0:
      dict_words[word]=distance
  #print(dict_words)
  return dict_words
print(find_distance_in_dups("AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM",3))   

Upvotes: 1

makes
makes

Reputation: 6548

You can use regular expressions as described in this answer:

import collections, re

Code copied from question:

A = "AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM"                                                                                                                                                                            

trigrams = [A[i:i+3] for i in range(len(A)-2)]
counts = collections.Counter(trigrams)
repeated = [trigram for trigram, count in counts.items() if count > 1]

Find distances using regex:

dists = {}
for r in repeated:
    matches = [m for m in re.finditer(r, A)]
    dists[r] = matches[1].start() - (matches[0].end()-1)                                                                                                                                                                                

print(dists)

Output:

{'JSS': 10, 'PPL': 16}

This of course finds only the distance between the first two matches. You didn't specify if further occurrences should count.

Upvotes: 0

Just wrote this one,

To solve your problem you however have to indicate the length of the word you are looking for I think as a parameter.

import numpy as np 

def find_duplicates(string,length=3):
    all_words = np.array([string[0+i:length+i] for i in range(len(string)-length)])
    unique_words = np.unique(all_words)
    dico = {}
    for unique in unique_words:
        loc = np.where(all_words==unique)[0]
        if len(loc)>1:
            dico[unique] = np.diff(loc) - (length - 1) #distance between consecutive duplicates
    return dico

A = "AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM"

print(find_duplicates(A,length=3)) # {'JSS': 10, 'PPL': 16}


Upvotes: 0

Related Questions