Reputation: 61
Currently I am working on a code that should be able to detect chemical groups in a molecule and list them after being given the smiles of the molecule as input.
Overall the code works great, but the code has issues when detecting cycles either aromatic, hetero cycles and even the ring in cyclohexanol. The same issue is there for alkenes, while either it detects only the alkene, but it cannot differentiate between cis and trans or aromatics.
Could someone tell me what smarts patterns I could use to find cycles and even differentiate them depending on the ring sizes and maybe also define specifics such as if hetero atoms are present and if the ring is aromatic? And a solution for determining the difference between cis and trans alkenes.
My code has a very long list of functional groups but I will just add a few here, so you know how it looks like:
from rdkit import Chem
def find_smiles_patterns(smiles):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return "Invalid SMILES string. Unable to parse molecule."
# Define a list to store the chemical groups found in the SMILES
chemical_groups = []
# SMARTS patterns to recognize chemical groups
smarts_patterns = {
'C=C': 'Alkene',
'[CX2]#[CX2]': 'Alkyne',
'[CX3]=[CX2]=[CX3]': 'Allene',
'[ClX1][CX4]': 'Alkylchloride',
'[FX1][CX4]': 'Alkylfluoride',
'[BrX1][CX4]': 'Alkylbromide',
'[IX1][CX4]': 'Alkyliodide',
'[OX2H][CX4H2;!$(C([OX2H])[O,S,#7,#15])]': 'Primary_alcohol',
'[OX2H][CX4H;!$(C([OX2H])[O,S,#7,#15])]': 'Secondary_alcohol',
'[OX2H][CX4D4;!$(C([OX2H])[O,S,#7,#15])]': 'Tertiary_alcohol',
'[OX2]([CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])])[CX4;!$(C([OX2])[O,S,#7,#15])]': 'Dialkylether',
'[SX2]([CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])])[CX4;!$(C([OX2])[O,S,#7,#15])]': 'Dialkylthioether',
'[OX2](c)[CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])]': 'Alkylarylether',
'[c][OX2][c]': 'Diarylether',
'[SX2](c)[CX4;!$(C([OX2])[O,S,#7,#15,F,Cl,Br,I])]': 'Alkylarylthioether',
'[c][SX2][c]': 'Diarylthioether',
'[O+;!$([O]~[!#6]);!$([S]*~[#7,#8,#15,#16])]': 'Oxonium',
'[NX3H2+0,NX4H3+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]': 'Primary_aliph_amine',
'[NX3H1+0,NX4H2+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]': 'Secondary_aliph_amine',
'[NX3H0+0,NX4H1+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]': 'Tertiary_aliph_amine',
'[NX4H0+;!$([N][!C]);!$([N]*~[#7,#8,#15,#16])]': 'Quaternary_aliph_ammonium',
'[!#6;!R0]': 'Heterocyclic'
#etc....
}
# Define priority order for chemical groups based on IUPAC nomenclature
priority_order = [
'Carboxylic_acid',
'Carboxylic_ester',
'Lactone',
'Carboxylic_anhydride',
'Carbothioic_acid',
'Aldehyde',
'Ketone',
'Alkylchloride',
'Alkylfluoride',
'Alkylbromide',
'Alkyliodide',
'Alcohol',
'Primary_alcohol',
'Secondary_alcohol',
'Tertiary_alcohol',
'Dialkylether',
'Alkene',
'Alkyne',
'Allene',
'Dialkylthioether',
'Alkylarylether',
'Diarylether',
'Alkylarylthioether',
'Diarylthioether',
'Oxonium',
'Primary_aliph_amine',
'Secondary_aliph_amine',
'Tertiary_aliph_amine',
'Quaternary_aliph_ammonium',
'Heterocycle'
#etc......
]
# Track the atom indices to avoid duplicates
atom_indices = set()
# Iterate over the priority order and check if each chemical group is present in the molecule
for group in priority_order:
if group in smarts_patterns.values():
for smarts_pattern, chemical_group in smarts_patterns.items():
if chemical_group == group:
pattern = Chem.MolFromSmarts(smarts_pattern)
if pattern:
matches = mol.GetSubstructMatches(pattern)
if len(matches) > 0:
print('matches !!! : ', smarts_pattern , smarts_patterns[smarts_pattern], pattern)
print(matches,'\n\n')
for match in matches:
match_set = set(match)
if not any(atom_index in match_set for atom_index in atom_indices):
chemical_groups.append(chemical_group)
atom_indices.update(match_set)
return chemical_groups
smiles = "c1(cccc2c1ccc1c2cccc1CCl)CO"
print(find_smiles_patterns(smiles))
Used on invented molecule:
SMILES: c1(cccc2c1ccc1c2cccc1CCl)CO
Results:
matches !!! : [ClX1][CX4] Alkylchloride <rdkit.Chem.rdchem.Mol object at .............>
((15, 14),)
matches !!! : [OX2H][CX4H2;!$(C([OX2H])[O,S,#7,#15])] Primary_alcohol <rdkit.Chem.rdchem.Mol object at .............>
((17, 16),)
['Alkylchloride', 'Primary_alcohol']
I did try change the Smarts, I also tried to do a placeholder function for detecting rings with a function checking a smiles of the form C1[X]nX1, while n is 2-8 and X is in the atom list: C, N, O, S
However nothing worked so far and it seems that there is no database for the smarts.
ADDENDUM
Atom index on the SMILES: c1(cccc2c1ccc1c2cccc1CCl)CO ,
should be this one, but please doublecheck
Upvotes: 2
Views: 486