vdlmrc
vdlmrc

Reputation: 755

Bond Type of ligands in PDB files all appear as SINGLE

I am learning rdkit. At the moment I want to extract info from the ligand docked in the protein. The problem I face is that the bonds from the ligand are always returned as SINGLE, no matter there actual types, while the bond types of the protein are well returned.

Anybody know what I am missing?

For the below code example I am looking at the P38 kinase, 2GHL PDB file, and the ligand is names "LIB". But I observed the same issue with all the PDB files I tried.

My strategy is in 3 steps:

  1. Iterate through the atoms of the PDB file
  2. Store all the atoms with "LIB" as residue name in a list called ligand_atoms
  3. Iterate again through the atoms in the PDB file, using an embedded loop, only between the index of first and last atom of the ligand and print the atoms symbols, indexes and bond types.
from rdkit import Chem

# Specify the path to your PDB file
pdb_file = "path\\2ghl.pdb"

# Load the PDB file using the Chem module
mol = Chem.MolFromPDBFile(pdb_file)
 
# Create empty lists to store ligand and protein atoms
ligand_atoms = []

for i in range(len(mol.GetAtoms())):
    residue_name = mol.GetAtomWithIdx(i).GetPDBResidueInfo().GetResidueName()

    if residue_name == "LIB":        
        ligand_atoms.append(mol.GetAtomWithIdx(i))
     
# Find index of first and last atom of the ligand
first_atom_idx = ligand_atoms[0].GetIdx()
last_atom_idx = ligand_atoms[len(ligand_atoms)-1].GetIdx()
     
# Return bond type between each atom of the ligand if there is a bond
for i in range(first_atom_idx, last_atom_idx+1):
    for j in range(i+1, last_atom_idx+1):
        if mol.GetBondBetweenAtoms(i, j):
            print(mol.GetAtomWithIdx(i).GetSymbol(), " ", i, ", ",
                  mol.GetAtomWithIdx(j).GetSymbol(), " ", j, ": ",
                  mol.GetBondBetweenAtoms(i, j).GetBondType())

Here is the output:

enter image description here

While the molecule contains double bonds and aromatic bonds:

enter image description here

Upvotes: 3

Views: 592

Answers (1)

pippo1980
pippo1980

Reputation: 3023

Don't know what you are missing , the PDB format probably works like this,

see a nice webpage here describing various file formats that states:

This is probably the most boring part of the course. However, getting an understanding of how chemical structure are encoded in different formats is very useful because different software will often have different ‘default’ file formats and it can be useful to know what these formats are capable of (or not).

and about PDB file format :

Note Generally speaking, pdb files are best avoided in computational chemistry and formats like mol2 are more reliable for storing your structures. This is particularly true for small molecules such as the ligands that you will be working with in protein-ligand docking calculations. One important reason for this mol2 files contain both the molecular connectivity and the bond type/order (the @BOND section in the example above) so that the way that the atoms are bonded is defined explicitly and does not need to be guessed at by the software. This is not always the case for the different versions of pdb file that exist and can lead to misinterpretation of the molecular structure by different software packages you may be using.

Concerning your question/problem using only data provided by RCSB Protein Data Bank (RCSB PDB) that relies on The Chemical Component Dictionary :

The wwPDB Chemical Component dictionary (CCD) defines each unique small-molecule ligand found in the PDB with a distinct identifier CCD ID and a detailed chemical description (Westbrook et al., 2015). Each uniquely defined CCD ligand may be present in multiple PDB structures and may have multiple copies within one PDB structure.

I ended up with two solution :

first one uses Chemical Component Dictionary : mmCIF (plain text) , I don't know how often is updated and the Gemmi CIF Parser

import rdkit

print('RDKit Version : ', rdkit.__version__)

from rdkit import Chem

from gemmi import cif


path ='components.cif'

try:
    doc = cif.read_file(path)  # copy all the data from mmCIF file
    
    print(doc)
    
    for block in doc :
            
        if block.name == 'LIB':
    
            print('\n\n')
        
            print(block.name)
        
            print('\n\n')
            
            
            print(block)
            
            descriptor = block.find('_pdbx_chem_comp_descriptor.', 
                                         
                                         ['comp_id', 'type', 'program', 'program_version', 'descriptor'])
                                         
            
            print(descriptor)
            
            descriptor2 =  block.get_mmcif_category('_pdbx_chem_comp_descriptor.')
           
            print('\n\n',descriptor2)
            
            
            for row in descriptor :
                
                if row[1] == 'SMILES':
                
                    
                    print(row)
                    
                    
                    smiles = row[4].strip('\"')
                    
                    # mol = None
            
                    mol = Chem.MolFromSmiles(smiles)
                    
                    break


except :
    
    print('something went wrong')
    
    
    
    
if mol != None :
    
    
    # Create empty lists to store ligand and protein atoms
    ligand_atoms = []

    for i in range(len(mol.GetAtoms())):
        
        ligand_atoms.append(mol.GetAtomWithIdx(i))
         
    
    
    # Find index of first and last atom of the ligand
    first_atom_idx = ligand_atoms[0].GetIdx()
    last_atom_idx = ligand_atoms[len(ligand_atoms)-1].GetIdx()
    
    
         
    # Return bond type between each atom of the ligand if there is a bond
    for i in range(first_atom_idx, last_atom_idx+1):
        for j in range(i+1, last_atom_idx+1):
            if mol.GetBondBetweenAtoms(i, j):
                print(mol.GetAtomWithIdx(i).GetSymbol(), " ", i, ", ",
                      mol.GetAtomWithIdx(j).GetSymbol(), " ", j, ": ",
                      mol.GetBondBetweenAtoms(i, j).GetBondType())

Don't know if it is the best way of using the tool but gives the needed results. It is kind of slow on my machine because the 'components.cif' is about 405MB.

some of the ouyput :

....
.......
<gemmi.cif.Table.Row: LIB SMILES ACDLabs 10.04 "Clc1ccccc1NC(=O)N(c2nc(ncc2)NC(C(O)(C)C)C)c3ccc(OC)cc3">
Cl   0 ,  C   1 :  SINGLE
C   1 ,  C   2 :  AROMATIC
C   1 ,  C   6 :  AROMATIC
C   2 ,  C   3 :  AROMATIC
C   3 ,  C   4 :  AROMATIC
C   4 ,  C   5 :  AROMATIC
C   5 ,  C   6 :  AROMATIC
C   6 ,  N   7 :  SINGLE
N   7 ,  C   8 :  SINGLE
C   8 ,  O   9 :  DOUBLE
C   8 ,  N   10 :  SINGLE
N   10 ,  C   11 :  SINGLE
N   10 ,  C   24 :  SINGLE
C   11 ,  N   12 :  AROMATIC
C   11 ,  C   16 :  AROMATIC
N   12 ,  C   13 :  AROMATIC
C   13 ,  N   14 :  AROMATIC
C   13 ,  N   17 :  SINGLE
N   14 ,  C   15 :  AROMATIC
C   15 ,  C   16 :  AROMATIC
N   17 ,  C   18 :  SINGLE
C   18 ,  C   19 :  SINGLE
C   18 ,  C   23 :  SINGLE
C   19 ,  O   20 :  SINGLE
C   19 ,  C   21 :  SINGLE
C   19 ,  C   22 :  SINGLE
C   24 ,  C   25 :  AROMATIC
C   24 ,  C   31 :  AROMATIC
C   25 ,  C   26 :  AROMATIC
C   26 ,  C   27 :  AROMATIC
C   27 ,  O   28 :  SINGLE
C   27 ,  C   30 :  AROMATIC
O   28 ,  C   29 :  SINGLE
C   30 ,  C   31 :  AROMATIC

Second one uses the RCSB PDB Data API , so you need internet connection to use the Python Requests library :

import rdkit


print('RDKit Version : ', rdkit.__version__)


from rdkit import Chem

import requests



try : 

    risposta = requests.get('https://data.rcsb.org/rest/v1/core/chemcomp/LIB')
    
    print(risposta.status_code)
    
    print(type(risposta.json()))
    
    smiles = risposta.json()['rcsb_chem_comp_descriptor']['smiles']
    
    print('\nsmiles : ', smiles)
    
    # print(risposta.json())
    
except:
    

    print('something went wrong')
        
    
    
    
    
if smiles != None :
    
    
    mol = Chem.MolFromSmiles(smiles)
    
    
    # Create empty lists to store ligand and protein atoms
    ligand_atoms = []

    for i in range(len(mol.GetAtoms())):
        
        ligand_atoms.append(mol.GetAtomWithIdx(i))
         
    
    
    # Find index of first and last atom of the ligand
    first_atom_idx = ligand_atoms[0].GetIdx()
    last_atom_idx = ligand_atoms[len(ligand_atoms)-1].GetIdx()
    
    
         
    # Return bond type between each atom of the ligand if there is a bond
    for i in range(first_atom_idx, last_atom_idx+1):
        for j in range(i+1, last_atom_idx+1):
            if mol.GetBondBetweenAtoms(i, j):
                print(mol.GetAtomWithIdx(i).GetSymbol(), " ", i, ", ",
                      mol.GetAtomWithIdx(j).GetSymbol(), " ", j, ": ",
                      mol.GetBondBetweenAtoms(i, j).GetBondType())

With output :

......
..........
smiles :  CC(C(C)(C)O)Nc1nccc(n1)N(c2ccc(cc2)OC)C(=O)Nc3ccccc3Cl
C   0 ,  C   1 :  SINGLE
C   1 ,  C   2 :  SINGLE
C   1 ,  N   6 :  SINGLE
C   2 ,  C   3 :  SINGLE
C   2 ,  C   4 :  SINGLE
C   2 ,  O   5 :  SINGLE
N   6 ,  C   7 :  SINGLE
C   7 ,  N   8 :  AROMATIC
C   7 ,  N   12 :  AROMATIC
N   8 ,  C   9 :  AROMATIC
C   9 ,  C   10 :  AROMATIC
C   10 ,  C   11 :  AROMATIC
C   11 ,  N   12 :  AROMATIC
C   11 ,  N   13 :  SINGLE
N   13 ,  C   14 :  SINGLE
N   13 ,  C   22 :  SINGLE
C   14 ,  C   15 :  AROMATIC
C   14 ,  C   19 :  AROMATIC
C   15 ,  C   16 :  AROMATIC
C   16 ,  C   17 :  AROMATIC
C   17 ,  C   18 :  AROMATIC
C   17 ,  O   20 :  SINGLE
C   18 ,  C   19 :  AROMATIC
O   20 ,  C   21 :  SINGLE
C   22 ,  O   23 :  DOUBLE
C   22 ,  N   24 :  SINGLE
N   24 ,  C   25 :  SINGLE
C   25 ,  C   26 :  AROMATIC
C   25 ,  C   30 :  AROMATIC
C   26 ,  C   27 :  AROMATIC
C   27 ,  C   28 :  AROMATIC
C   28 ,  C   29 :  AROMATIC
C   29 ,  C   30 :  AROMATIC
C   30 ,  Cl   31 :  SINGLE

Pay attention to the fact that the two LIB molecule SMILES parsed and fetched

are different in term of atom ordering, both 'components.cif' and RCSB PDB Data API carries/exposes more than one kind (generated by different software SMILES) , feel free to explore the API and the dict for LIB ligand.

You can dowload all the ligand in different formats using the Ligand Expo see Ligand Expo Downloads.

Please by advised that I am not an expert on this matter i.e. I don't know if there is any API to download mol formats of the Ligands in RCSB Protein Data Bank (RCSB PDB) and even if the Ligand SMILES are part of the Dictionary Index mmcif_pdbx.dic or will be in future.

Concernig why there are different SMILES for same molecule provided by different Software that would be another question too, probably better suited for Chemistry Stack Exchange

ADDENDUM

I just realized you probably need the LIB molecule with the atomic coordinates present in the model.

I know you can find it here :

LIB , you can dowload the .sdf format of it. Need to figure out if .mol format is available too and if there is a way to get it from the Data API query to generate the webpage or from some other FTP webserver, probably the approach will be like the one in my second answer, I am new to this I'll try to find out how it work.

Ok seems like solution was in front of my eyes :

https://www.rcsb.org/structure/2GHL has a link to :

enter image description here

https://models.rcsb.org/v1/2ghl/ligand?auth_seq_id=401&label_asym_id=B&encoding=mol2&filename=2ghl_B_LIB.mol2 that points to the ModelServer 0.9.11 OAS 3.0 API

You'll need to figure out how to play with the API parameters

this one works too:

https://models.rcsb.org/v1/2ghl/ligand?label_entity_id=2&encoding=mol2&copy_all_categories=true&download=true&filename=pippo

Here an attempt using Gemmi to parse [https://www.rcsb.org/structure/2GHL] '2ghl.cif' model in .cif format and Requests to access ModelServer 0.9.11 OAS 3.0 (The ModelServer is a service for accessing subsets of macromolecular model data).

from gemmi import cif

import requests


path ='2ghl.cif'

pdb_id = path.split('.')[0]

encoding = 'mol2'

print('pdb_id : ', pdb_id)

try:
    doc = cif.read_file(path)
    
    for block in doc :
        
        print(block)
        
        descriptor2 =  block.get_mmcif_category('_pdbx_entity_nonpoly.')
       
        print('\n\n',descriptor2)
        
        for row in descriptor2:
                
            print('\nrow : ', row ,'\n')
        
        descriptor = block.find('_pdbx_entity_nonpoly.' , ['entity_id' , 'name' , 'comp_id'])
        
        print(descriptor)
        
        for row in descriptor:
                
            print('\nrow : ', row ,'\n')
            
            if row[2] == 'LIB':
            
                print('_pdbx_entity_nonpoly.entity_id : ', row[0])
                
                label_entity_id = row[0]
                
                filename = 'downloaded_'+row[2]+'_'+row[0]+'_.'+encoding
            
                print('\n')
                
                try : 
                         
                    string = f'https://models.rcsb.org/v1/{pdb_id}/ligand?label_entity_id={label_entity_id}&encoding={encoding}&copy_all_categories=false&download=false&filename={filename}'
                    
                    print(string)
                    
                    risposta = requests.get(string,
                                            
                                            allow_redirects=True)
                    
                    open(filename, 'wb').write(risposta.content)
                    
                    print(risposta.status_code)
                    
                    print(type(risposta))
                    
                    print(risposta.content)
                
                except :
                    
                    print('something went wrong')
                           
except :
    
    print('something went wrong')

Upvotes: 1

Related Questions