Reputation: 755
I am learning rdkit. At the moment I want to extract info from the ligand docked in the protein. The problem I face is that the bonds from the ligand are always returned as SINGLE, no matter there actual types, while the bond types of the protein are well returned.
Anybody know what I am missing?
For the below code example I am looking at the P38 kinase, 2GHL PDB file, and the ligand is names "LIB". But I observed the same issue with all the PDB files I tried.
My strategy is in 3 steps:
from rdkit import Chem
# Specify the path to your PDB file
pdb_file = "path\\2ghl.pdb"
# Load the PDB file using the Chem module
mol = Chem.MolFromPDBFile(pdb_file)
# Create empty lists to store ligand and protein atoms
ligand_atoms = []
for i in range(len(mol.GetAtoms())):
residue_name = mol.GetAtomWithIdx(i).GetPDBResidueInfo().GetResidueName()
if residue_name == "LIB":
ligand_atoms.append(mol.GetAtomWithIdx(i))
# Find index of first and last atom of the ligand
first_atom_idx = ligand_atoms[0].GetIdx()
last_atom_idx = ligand_atoms[len(ligand_atoms)-1].GetIdx()
# Return bond type between each atom of the ligand if there is a bond
for i in range(first_atom_idx, last_atom_idx+1):
for j in range(i+1, last_atom_idx+1):
if mol.GetBondBetweenAtoms(i, j):
print(mol.GetAtomWithIdx(i).GetSymbol(), " ", i, ", ",
mol.GetAtomWithIdx(j).GetSymbol(), " ", j, ": ",
mol.GetBondBetweenAtoms(i, j).GetBondType())
Here is the output:
While the molecule contains double bonds and aromatic bonds:
Upvotes: 3
Views: 592
Reputation: 3023
Don't know what you are missing , the PDB format probably works like this,
see a nice webpage here describing various file formats that states:
This is probably the most boring part of the course. However, getting an understanding of how chemical structure are encoded in different formats is very useful because different software will often have different ‘default’ file formats and it can be useful to know what these formats are capable of (or not).
and about PDB file format :
Note Generally speaking, pdb files are best avoided in computational chemistry and formats like mol2 are more reliable for storing your structures. This is particularly true for small molecules such as the ligands that you will be working with in protein-ligand docking calculations. One important reason for this mol2 files contain both the molecular connectivity and the bond type/order (the @BOND section in the example above) so that the way that the atoms are bonded is defined explicitly and does not need to be guessed at by the software. This is not always the case for the different versions of pdb file that exist and can lead to misinterpretation of the molecular structure by different software packages you may be using.
Concerning your question/problem using only data provided by RCSB Protein Data Bank (RCSB PDB) that relies on The Chemical Component Dictionary :
The wwPDB Chemical Component dictionary (CCD) defines each unique small-molecule ligand found in the PDB with a distinct identifier CCD ID and a detailed chemical description (Westbrook et al., 2015). Each uniquely defined CCD ligand may be present in multiple PDB structures and may have multiple copies within one PDB structure.
I ended up with two solution :
first one uses Chemical Component Dictionary : mmCIF (plain text) , I don't know how often is updated and the Gemmi CIF Parser
import rdkit
print('RDKit Version : ', rdkit.__version__)
from rdkit import Chem
from gemmi import cif
path ='components.cif'
try:
doc = cif.read_file(path) # copy all the data from mmCIF file
print(doc)
for block in doc :
if block.name == 'LIB':
print('\n\n')
print(block.name)
print('\n\n')
print(block)
descriptor = block.find('_pdbx_chem_comp_descriptor.',
['comp_id', 'type', 'program', 'program_version', 'descriptor'])
print(descriptor)
descriptor2 = block.get_mmcif_category('_pdbx_chem_comp_descriptor.')
print('\n\n',descriptor2)
for row in descriptor :
if row[1] == 'SMILES':
print(row)
smiles = row[4].strip('\"')
# mol = None
mol = Chem.MolFromSmiles(smiles)
break
except :
print('something went wrong')
if mol != None :
# Create empty lists to store ligand and protein atoms
ligand_atoms = []
for i in range(len(mol.GetAtoms())):
ligand_atoms.append(mol.GetAtomWithIdx(i))
# Find index of first and last atom of the ligand
first_atom_idx = ligand_atoms[0].GetIdx()
last_atom_idx = ligand_atoms[len(ligand_atoms)-1].GetIdx()
# Return bond type between each atom of the ligand if there is a bond
for i in range(first_atom_idx, last_atom_idx+1):
for j in range(i+1, last_atom_idx+1):
if mol.GetBondBetweenAtoms(i, j):
print(mol.GetAtomWithIdx(i).GetSymbol(), " ", i, ", ",
mol.GetAtomWithIdx(j).GetSymbol(), " ", j, ": ",
mol.GetBondBetweenAtoms(i, j).GetBondType())
Don't know if it is the best way of using the tool but gives the needed results. It is kind of slow on my machine because the 'components.cif' is about 405MB.
some of the ouyput :
....
.......
<gemmi.cif.Table.Row: LIB SMILES ACDLabs 10.04 "Clc1ccccc1NC(=O)N(c2nc(ncc2)NC(C(O)(C)C)C)c3ccc(OC)cc3">
Cl 0 , C 1 : SINGLE
C 1 , C 2 : AROMATIC
C 1 , C 6 : AROMATIC
C 2 , C 3 : AROMATIC
C 3 , C 4 : AROMATIC
C 4 , C 5 : AROMATIC
C 5 , C 6 : AROMATIC
C 6 , N 7 : SINGLE
N 7 , C 8 : SINGLE
C 8 , O 9 : DOUBLE
C 8 , N 10 : SINGLE
N 10 , C 11 : SINGLE
N 10 , C 24 : SINGLE
C 11 , N 12 : AROMATIC
C 11 , C 16 : AROMATIC
N 12 , C 13 : AROMATIC
C 13 , N 14 : AROMATIC
C 13 , N 17 : SINGLE
N 14 , C 15 : AROMATIC
C 15 , C 16 : AROMATIC
N 17 , C 18 : SINGLE
C 18 , C 19 : SINGLE
C 18 , C 23 : SINGLE
C 19 , O 20 : SINGLE
C 19 , C 21 : SINGLE
C 19 , C 22 : SINGLE
C 24 , C 25 : AROMATIC
C 24 , C 31 : AROMATIC
C 25 , C 26 : AROMATIC
C 26 , C 27 : AROMATIC
C 27 , O 28 : SINGLE
C 27 , C 30 : AROMATIC
O 28 , C 29 : SINGLE
C 30 , C 31 : AROMATIC
Second one uses the RCSB PDB Data API , so you need internet connection to use the Python Requests library :
import rdkit
print('RDKit Version : ', rdkit.__version__)
from rdkit import Chem
import requests
try :
risposta = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/LIB')
print(risposta.status_code)
print(type(risposta.json()))
smiles = risposta.json()['rcsb_chem_comp_descriptor']['smiles']
print('\nsmiles : ', smiles)
# print(risposta.json())
except:
print('something went wrong')
if smiles != None :
mol = Chem.MolFromSmiles(smiles)
# Create empty lists to store ligand and protein atoms
ligand_atoms = []
for i in range(len(mol.GetAtoms())):
ligand_atoms.append(mol.GetAtomWithIdx(i))
# Find index of first and last atom of the ligand
first_atom_idx = ligand_atoms[0].GetIdx()
last_atom_idx = ligand_atoms[len(ligand_atoms)-1].GetIdx()
# Return bond type between each atom of the ligand if there is a bond
for i in range(first_atom_idx, last_atom_idx+1):
for j in range(i+1, last_atom_idx+1):
if mol.GetBondBetweenAtoms(i, j):
print(mol.GetAtomWithIdx(i).GetSymbol(), " ", i, ", ",
mol.GetAtomWithIdx(j).GetSymbol(), " ", j, ": ",
mol.GetBondBetweenAtoms(i, j).GetBondType())
With output :
......
..........
smiles : CC(C(C)(C)O)Nc1nccc(n1)N(c2ccc(cc2)OC)C(=O)Nc3ccccc3Cl
C 0 , C 1 : SINGLE
C 1 , C 2 : SINGLE
C 1 , N 6 : SINGLE
C 2 , C 3 : SINGLE
C 2 , C 4 : SINGLE
C 2 , O 5 : SINGLE
N 6 , C 7 : SINGLE
C 7 , N 8 : AROMATIC
C 7 , N 12 : AROMATIC
N 8 , C 9 : AROMATIC
C 9 , C 10 : AROMATIC
C 10 , C 11 : AROMATIC
C 11 , N 12 : AROMATIC
C 11 , N 13 : SINGLE
N 13 , C 14 : SINGLE
N 13 , C 22 : SINGLE
C 14 , C 15 : AROMATIC
C 14 , C 19 : AROMATIC
C 15 , C 16 : AROMATIC
C 16 , C 17 : AROMATIC
C 17 , C 18 : AROMATIC
C 17 , O 20 : SINGLE
C 18 , C 19 : AROMATIC
O 20 , C 21 : SINGLE
C 22 , O 23 : DOUBLE
C 22 , N 24 : SINGLE
N 24 , C 25 : SINGLE
C 25 , C 26 : AROMATIC
C 25 , C 30 : AROMATIC
C 26 , C 27 : AROMATIC
C 27 , C 28 : AROMATIC
C 28 , C 29 : AROMATIC
C 29 , C 30 : AROMATIC
C 30 , Cl 31 : SINGLE
Pay attention to the fact that the two LIB molecule SMILES parsed and fetched
are different in term of atom ordering, both 'components.cif' and RCSB PDB Data API carries/exposes more than one kind (generated by different software SMILES) , feel free to explore the API and the dict for LIB ligand.
You can dowload all the ligand in different formats using the Ligand Expo see Ligand Expo Downloads.
Please by advised that I am not an expert on this matter i.e. I don't know if there is any API to download mol
formats of the Ligands in RCSB Protein Data Bank (RCSB PDB) and even if the Ligand SMILES are part of the Dictionary Index mmcif_pdbx.dic or will be in future.
Concernig why there are different SMILES for same molecule provided by different Software that would be another question too, probably better suited for Chemistry Stack Exchange
ADDENDUM
I just realized you probably need the LIB molecule with the atomic coordinates present in the model.
I know you can find it here :
LIB , you can dowload the .sdf format of it. Need to figure out if .mol format is available too and if there is a way to get it from the Data API query to generate the webpage or from some other FTP webserver, probably the approach will be like the one in my second answer, I am new to this I'll try to find out how it work.
Ok seems like solution was in front of my eyes :
https://www.rcsb.org/structure/2GHL has a link to :
https://models.rcsb.org/v1/2ghl/ligand?auth_seq_id=401&label_asym_id=B&encoding=mol2&filename=2ghl_B_LIB.mol2 that points to the ModelServer 0.9.11 OAS 3.0 API
You'll need to figure out how to play with the API parameters
this one works too:
Here an attempt using Gemmi to parse [https://www.rcsb.org/structure/2GHL] '2ghl.cif' model in .cif format and Requests to access ModelServer 0.9.11 OAS 3.0 (The ModelServer is a service for accessing subsets of macromolecular model data).
from gemmi import cif
import requests
path ='2ghl.cif'
pdb_id = path.split('.')[0]
encoding = 'mol2'
print('pdb_id : ', pdb_id)
try:
doc = cif.read_file(path)
for block in doc :
print(block)
descriptor2 = block.get_mmcif_category('_pdbx_entity_nonpoly.')
print('\n\n',descriptor2)
for row in descriptor2:
print('\nrow : ', row ,'\n')
descriptor = block.find('_pdbx_entity_nonpoly.' , ['entity_id' , 'name' , 'comp_id'])
print(descriptor)
for row in descriptor:
print('\nrow : ', row ,'\n')
if row[2] == 'LIB':
print('_pdbx_entity_nonpoly.entity_id : ', row[0])
label_entity_id = row[0]
filename = 'downloaded_'+row[2]+'_'+row[0]+'_.'+encoding
print('\n')
try :
string = f'https://models.rcsb.org/v1/{pdb_id}/ligand?label_entity_id={label_entity_id}&encoding={encoding}©_all_categories=false&download=false&filename={filename}'
print(string)
risposta = requests.get(string,
allow_redirects=True)
open(filename, 'wb').write(risposta.content)
print(risposta.status_code)
print(type(risposta))
print(risposta.content)
except :
print('something went wrong')
except :
print('something went wrong')
Upvotes: 1