Speeding up the implementation of the similarity of compounds' canonical smiles using rdkit

Question

This wannabe bioinformatician needs your help. The code below finds the similarity of compounds' canonical smiles, using rdkit. After some research I understand it must be O(n)! (or not?) because for a small file of 944 entries it took 20 minutes while for the largest one which is 330.000 entries has been running for over 30 hours. Now, I now that one of its problems is that it doesn't compare the elements only once so that is one factor which slows it down. I read here that you can use the itertools library to make a comparison fast, but generally how could this code be made better? Any help would be appreciated while I try to learn :)

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem import AllChem
import pandas as pd


l =[]
s1 = []
s2 = []
d1 = []
d2 = []
with open('input_file.csv', 'r') as f:
    df = pd.read_csv(f, delimiter = ',', lineterminator = '
', header = 0)
    for i in range(0, df.shape[0]):
        l.append(df.iloc[i, 1])


for i in range(0, df.shape[0]):
    for j in range(0, df.shape[0]):
        m1 = Chem.MolFromSmiles(df.iloc[i, 1])
        fp1 = AllChem.GetMorganFingerprint(m1,2)
        m2 = Chem.MolFromSmiles(df.iloc[j, 1])
        fp2 = AllChem.GetMorganFingerprint(m2,2)
        sim = DataStructs.DiceSimilarity(fp1,fp2)
        if sim >= 0.99:
            s1.append(i)
            s2.append(j)
for k in range(0, len(s1)):
    if df.iloc[s1[k], 0] != df.iloc[s2[k], 0]:
        d1.append(df.iloc[s1[k], 0])
        d2.append(df.iloc[s2[k], 0])
if len(d1) != 0:
    with open('outputfile.tsv', 'a') as f2:
        for o in range(0, len(d1)):
            f2.write(str(d1[o]) + '	' + str(d2[0]) + '
')

Speeding up the implementation of the similarity of compounds' canonical smiles using rdkit

Answers (1)

Related Questions

Speeding up the implementation of the similarity of compounds&#39; canonical smiles using rdkit

Answers (1)

Related Questions

Speeding up the implementation of the similarity of compounds' canonical smiles using rdkit