user3091644
user3091644

Reputation: 439

A python function that discards molecules that are too similar based on Tanimoto coefficient?

I am trying to write a python function that takes two lists as input: one that contains some molecules SMILES codes and another one that contains the molecule names.

Then it calculates the TANIMOTO coefficient between all pairs of molecules (I already have a function for this) and returns two new lists with the SMILES and names, respectively, of all molecules whose Tanimoto with any other is not higher than a certain threshold.

This is what I have done so far, but it gives wrong results (most of the molecules I get are almost the same...):

def TanimotoFilter(molist,namelist,threshold):
    # molist is the smiles list
    # namelist is the name list (SURPRISE!) is this some global variable name?
    # threshold is the tanimoto threshold (SURPRISE AGAIN!)
    smilesout=[]
    names=[]
    tans=[]
    exclude=[]
    for i in range(1,len(molist)):
        if i not in exclude:
            smilesout.append(molist[i])
            names.append(namelist[i])
            for j in range(i,len(molist)):
                if i==j:
                   tans.append('SAME')
                else:
                   tanimoto=tanimoto_calc(molist[i],molist[j])
                   if tanimoto>threshold:
                      exclude.append(j)
                      #print 'breaking for '+str(i)+' '+str(j)
                      break
                   else:
                      tans.append(tanimoto)

    return smilesout, names, tans

I'd be very thankful if the modifications you propose are as basic as possible, as this code is for people who have scarcely seen a terminal in their lives... It doesn't matter if it is full of loops that make it slow.

Thank you all!

Upvotes: 2

Views: 610

Answers (1)

Vivek Pabani
Vivek Pabani

Reputation: 452

I have made some changes to the logic of the function. As mentioned in the question, it returns two lists with the SMILES and names. I am not sure about the purpose of tans since the tanimoto value is for a tuple and not for single molecule. Could not test the code without data, let me know if this works.

def TanimotoFilter(molist, namelist, threshold):
    # molist is the smiles list
    # namelist is the name list (SURPRISE!) is this some global variable name?
    # threshold is the tanimoto threshold (SURPRISE AGAIN!)
    smilesout=[]
    names=[]
    tans=[]
    exclude=[]

    for i in range(0, len(molist)):
        if i not in exclude:
            temp_exclude = []
            for j in range(i + 1, len(molist)):
                tanimoto = tanimoto_calc(molist[i], molist[j])
                if tanimoto > threshold:
                    temp_exclude.append(j)
            if temp_exclude:
                temp_exclude.append(i)
                exclude.extend(temp_exclude)
            else:
                smilesout.append(molist[i])
                names.append(namelist[i])

    return smilesout, names

Upvotes: 0

Related Questions