Biopython: resseq doesn't match pdb file

Question

I have a PDB file, and I need to extract its residue sequence numbers (resseqs). Based on manual inspection of the first few lines of the PDB file (pasted below), I would think that resseqs should be [22, 23, ...]. However, Biopython's Bio.PDB module suggests otherwise (output attached below as well). I wonder if it's a Biopython bug or if I have problems understanding the PDB format.

ATOM      1  N   GLY A  22      78.171  89.858  59.231  1.00 21.24           N  
ATOM      2  CA  GLY A  22      79.174  88.827  58.999  1.00 20.87           C  
ATOM      3  C   GLY A  22      80.438  89.415  58.391  1.00 21.89           C  
ATOM      4  O   GLY A  22      80.362  90.202  57.440  1.00 23.18           O  
ATOM      5  N   LEU A  23      81.588  89.069  58.972  1.00 21.51           N  
ATOM      6  CA  LEU A  23      82.895  89.555  58.527  1.00 20.80           C  
ATOM      7  C   LEU A  23      83.288  89.020  57.162  1.00 22.41           C  
ATOM      8  O   LEU A  23      82.889  87.923  56.788  1.00 22.93           O  
ATOM      9  CB  LEU A  23      83.973  89.232  59.560  1.00 20.97           C  
ATOM     10  CG  LEU A  23      84.225  87.818  60.062  1.00 13.32           C  
ATOM     11  CD1 LEU A  23      85.448  87.888  60.939  1.00 15.24           C  
ATOM     12  CD2 LEU A  23      83.035  87.258  60.829  1.00 12.21           C

The code I am using to extract resseq:

...
for i in chain:
    print i.get_full_id()

OUT:('pdb', 0, 'A', (' ', 2, ' '))
    ('pdb', 0, 'A', (' ', 3, ' '))
...

fsimkovic · Accepted Answer

From the documentation of Bio.PDB.Entity.get_full_id

def get_full_id(self):
    """Return the full id.

    The full id is a tuple containing all id's starting from
    the top object (Structure) down to the current object. A full id for
    a Residue object e.g. is something like:

    ("1abc", 0, "A", (" ", 10, "A"))

    This corresponds to:

    Structure with id "1abc"
    Model with id 0
    Chain with id "A"
    Residue with id (" ", 10, "A")

    The Residue id indicates that the residue is not a hetero-residue
    (or a water) because it has a blank hetero field, that its sequence
    identifier is 10 and its insertion code "A".
    """
    # The function implementation below here ...

I assume that you are iterating over the atoms of your chain rather than the residues, which gives you the full id of each Atom not Residue.

If you save example residues in a file called struct.pdb and run the code below, you get the correct ids.

>>> structure = PDBParser().get_structure('test', 'struct.pdb')
>>> for residue in structure.get_residues():
...    print(residue.get_full_id())
('test', 0, 'A', (' ', 22, ' '))
('test', 0, 'A', (' ', 23, ' '))
>>> resseqs = [residue.id[1] for residue in structure.get_residues()]
>>> print(resseqs)
[22, 23]

Biopython: resseq doesn't match pdb file

Answers (1)

Related Questions

Biopython: resseq doesn&#39;t match pdb file

Answers (1)

Related Questions

Biopython: resseq doesn't match pdb file