shamp7598
shamp7598

Reputation: 39

comparing lists of strings

My code consists of 4 lists splitinputString1, splitinputString2, splitinputString3, and mainlistsplit. The list mainlistsplit is much longer as it contains all possible outcomes of the 4 letters A,C,T,&. The other 3 lists consist of predetermined 10 letter input strings that have been split into 4 letter strings.

My goal is to find 4 letter strings from the mainlistsplit that exist in each of the 3 input strings at the same time. I also have to allow for the input strings to have a 1 letter mismatch minimum. For example: ACTG in main and ACTC in one of the input strings.

I have tried the def is_close_match() but I am sure I am missing something slight in my code I am just not sure what that is.

My question is how should i go about comparing each of these string lists, finding the strings that match with at most 1 mismatch, returning, and printing them

import itertools

# Creates 3 lists, one with each of the input strings
lst = ['A', 'C', 'T', 'G', 'A', 'C', 'G', 'C', 'A', 'G']
lst2 = ['T', 'C', 'A', 'C', 'A', 'A', 'C', 'G', 'G', 'G']
lst3 = ['G', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'T', 'T']

mainlist = ['A', 'C', 'T', 'G']
mainlistsplit = [''.join(i) for i in itertools.product(mainlist, repeat=4)]
# Function to  make all possible length 4 combos of mainList


# lists for the input strings when they are split
splitinputString1 = []
splitinputString2 = []
splitinputString3 = []

sequence_size = 4

# Takes the first 4 values of my lst, lst2, lst3, appends it to my split input strings, then increases the sequence by 1
for i in range(len(lst) - sequence_size + 1):
    sequence = ''.join(lst[i: i + 4])
    splitinputString1.append(sequence)

for i in range(len(lst2) - sequence_size + 1):
    sequence = ''.join(lst2[i: i + 4])
    splitinputString2.append(sequence)

for i in range(len(lst3) - sequence_size + 1):
    sequence = ''.join(lst3[i: i + 4])
    splitinputString3.append(sequence)

found = []


def is_close_match(mainlistsplit, s2):
    mismatches = 0
    for i in range(0, len(mainlistsplit)):
        if mainlistsplit[i] != s2[i]:
            mismatches += 1
        else:
            found = ''.join(s2)

    if mismatches > 1:
        return False
    else:
        return True

Upvotes: 0

Views: 77

Answers (2)

Arun Augustine
Arun Augustine

Reputation: 1766

Check this out.

import itertools
import difflib

# Creates 3 lists, one with each of the input strings
lst = ['A', 'C', 'T', 'G', 'A', 'C', 'G', 'C', 'A', 'G']
lst2 = ['T', 'C', 'A', 'C', 'A', 'A', 'C', 'G', 'G', 'G']
lst3 = ['G', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'T', 'T']

mainlist = ['A', 'C', 'T', 'G']
mainlistsplit = [''.join(i) for i in itertools.product(mainlist, repeat=4)]

# Function to  make all possible length 4 combos of mainList


# lists for the input strings when they are split
splitinputString1 = []
splitinputString2 = []
splitinputString3 = []

sequence_size = 4

# Takes the first 4 values of my lst, lst2, lst3, appends it to my split input strings, then increases the sequence by 1
for i in range(len(lst) - sequence_size + 1):
    sequence = ''.join(lst[i: i + 4])
    splitinputString1.append(sequence)

for i in range(len(lst2) - sequence_size + 1):
    sequence = ''.join(lst2[i: i + 4])
    splitinputString2.append(sequence)

for i in range(len(lst3) - sequence_size + 1):
    sequence = ''.join(lst3[i: i + 4])
    splitinputString3.append(sequence)


def is_close_match(mainlistitem, lists):
    """
    Parsing full matched and sub matched items from the sub lists
    :param mainlistitem:
    :param lists:
    :return:
    """
    found = []
    partial_matched = []

    # Getting the partially matched items from a 4 letter string,
    # matching 75% (means 3 characters matches out of 4)
    for group in lists:
        partial_matched.extend(list(map(lambda x: difflib.get_close_matches(x, mainlistitem, cutoff=0.75), group)))
    found.extend(list(itertools.chain.from_iterable(partial_matched)))

    # Getting fully matched items from the 4 letter main string list.
    found.extend([i for group in lists for i in mainlistitem if i in group])
    return set(found)  # removing the duplicate matches in both cases


matching_list = is_close_match(mainlistsplit, [splitinputString1, splitinputString2, splitinputString3])
print(matching_list)

Upvotes: 0

dms
dms

Reputation: 1380

If I've got the question right, you could check if two strings are close with something like this:

def is_close_match(string1, string2):
  # 'string1' and 'string2' are assumed to have same length.
  return [c1 == c2 for c1, c2 in zip(string1, string2)].count(False) <= 1

where you count the number of characters that are not equals.

# 1 difference
print(is_close_match('ACTG', 'ACTC'))
# True

# no differences
print(is_close_match('ACTG', 'ACTG'))
# True

# 2 differences
print(is_close_match('ACTG', 'AGTC'))
# False

Then you can use is_close_match to filter you input lists and check if all the outputs have at least one element:

allLists = (
  splitinputString1,
  splitinputString2,
  splitinputString3,
)

for code in mainlistsplit:
  matches = [filter(lambda x: is_close_match(x, code), inputList)
             for inputList in allLists]
  if all(matches):
    print('Found {}: {}'.format(code, matches))

Upvotes: 1

Related Questions