Reputation: 161
A = "AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM"
print Repeat_Letter : [PPL:16, JSS:10] --> What I want
String A is a string that lists random characters that I suggested. Some of them are repetitive string. Repeated string within long strings is "PPL" and "JSS" in string A, respectively. 16 is a distance between the letter "PPL", and 10 is the distance between "JSS". And finally, the goal is to automatically determine repeated words and to express the distance between these words as a list in python.
trigrams = [A[i:i+3] for i in range(len(A)-2)]
counts = collections.Counter(trigrams)
repeated = [trigram for trigram, count in counts.items() if count > 1]
Through this, I checked which word is repeated. However, I'm wondering how to get the distance of these discriminated words. For example, we don't know how to get the distance between "PPL" and other "PPL".
Upvotes: 2
Views: 428
Reputation: 54
This uses find and string slicing method
def find_distance_in_dups(string,length):
dict_words={}
for i in range(len(string)-length-1):
word = string[i:length+i]
distance=string[string.find(word)+length:].find(word)+1
#print(distance)
if distance > 0:
dict_words[word]=distance
#print(dict_words)
return dict_words
print(find_distance_in_dups("AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM",3))
Upvotes: 1
Reputation: 6548
You can use regular expressions as described in this answer:
import collections, re
Code copied from question:
A = "AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM"
trigrams = [A[i:i+3] for i in range(len(A)-2)]
counts = collections.Counter(trigrams)
repeated = [trigram for trigram, count in counts.items() if count > 1]
Find distances using regex:
dists = {}
for r in repeated:
matches = [m for m in re.finditer(r, A)]
dists[r] = matches[1].start() - (matches[0].end()-1)
print(dists)
Output:
{'JSS': 10, 'PPL': 16}
This of course finds only the distance between the first two matches. You didn't specify if further occurrences should count.
Upvotes: 0
Reputation: 71
Just wrote this one,
To solve your problem you however have to indicate the length of the word you are looking for I think as a parameter.
import numpy as np
def find_duplicates(string,length=3):
all_words = np.array([string[0+i:length+i] for i in range(len(string)-length)])
unique_words = np.unique(all_words)
dico = {}
for unique in unique_words:
loc = np.where(all_words==unique)[0]
if len(loc)>1:
dico[unique] = np.diff(loc) - (length - 1) #distance between consecutive duplicates
return dico
A = "AEJXKWKJSSSJKZJLJLEJSSLKXMPPLSSKKDNEMSMLDMMEPPLETFMM"
print(find_duplicates(A,length=3)) # {'JSS': 10, 'PPL': 16}
Upvotes: 0