Reputation:
I want to separate the chain IDs which belong to specific Biological assemblies in a PDB file. As an Example PDB ID 1BRS has 3 Biological assemblies Biological assembly 1 : - chains A and D Biological assembly 2 : - Chains B and E Biological assembly 3 : - chains C and F
Is there a way (python Script) to get the Chain IDs separate which belong to each biological assembly as follows 1BRS_A:D 1BRS_B:E 1BRS_C:F No need to extract the chain coordinates. If I get the chain names, that will be enough. Thanks in advance
Upvotes: 0
Views: 772
Reputation: 11
Thanks for the code! It works nicely with serveral assemblies. But if the entry has only a single assembly it does not recognize it properly. Updated it to this:
import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.io.pdbx as pdbx
import json
ID = "3AV2"
#ID= "1BRS"
ID="2k6d"
#ID="1TBE"
ID="1HT2"
#ID="1HTI"
# Download structure
file_name = rcsb.fetch(ID, "pdbx", target_path=".")
# Read file
file = pdbx.PDBxFile()
file.read(file_name)
# Get 'entity_poly' category as dictionary
# to find out which chains are polymers
poly_chains = []
if isinstance(file["entity_poly"]["pdbx_strand_id"],str):
poly_chains=file["entity_poly"]["pdbx_strand_id"].split(",")
else:
for chain_list in file["entity_poly"]["pdbx_strand_id"]:
poly_chains += chain_list.split(",")
biolAssemblyDict={}
if isinstance(file["pdbx_struct_assembly_gen"]["asym_id_list"],str):
index=0
asym_id_list=file["pdbx_struct_assembly_gen"]["asym_id_list"]
chain_ids=asym_id_list.split(",")
chain_ids = [chain_id for chain_id in chain_ids if chain_id in poly_chains]
biolAssemblyDict[index+1]= ','.join(chain_ids)
else:
# Get 'pdbx_struct_assembly_gen' category as dictionary
for index,asym_id_list in enumerate(file["pdbx_struct_assembly_gen"]["asym_id_list"]):
chain_ids = asym_id_list.split(",")
# print(chain_ids)
# Filter chains that belong to a polymer
chain_ids = [chain_id for chain_id in chain_ids if chain_id in poly_chains]
biolAssemblyDict[index+1]= ','.join(chain_ids)
print(json.dumps(biolAssemblyDict,indent=4, sort_keys=True))
Upvotes: 0
Reputation: 970
The PDBx/mmCIF file format contains the information in the _pdbx_struct_assembly_gen
category.
loop_
_pdbx_struct_assembly_gen.assembly_id
_pdbx_struct_assembly_gen.oper_expression
_pdbx_struct_assembly_gen.asym_id_list
1 1 A,D,G,J
2 1 B,E,H,K
3 1 C,F,I,L
These files can be read e.g. with Biotite (https://www.biotite-python.org/), a package I am developing. The categories can be read in a dictionary-like manner:
import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.io.pdbx as pdbx
ID = "1BRS"
# Download structure
file_name = rcsb.fetch(ID, "pdbx", target_path=".")
# Read file
file = pdbx.PDBxFile()
file.read(file_name)
# Get 'pdbx_struct_assembly_gen' category as dictionary
assembly_dict = file["pdbx_struct_assembly_gen"]
for asym_id_list in assembly_dict["asym_id_list"]:
chain_ids = asym_id_list.split(",")
print(f"{ID}_{':'.join(chain_ids)}")
The output is
1BRS_A:D:G:J
1BRS_B:E:H:K
1BRS_C:F:I:L
The chains G-L contain only water molecules.
EDIT:
To include only chain IDs that belong to a polymer, e.g. a protein or a nucleotide, you can use the entity_poly
category:
loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.nstd_linkage
_entity_poly.nstd_monomer
_entity_poly.pdbx_seq_one_letter_code
_entity_poly.pdbx_seq_one_letter_code_can
_entity_poly.pdbx_strand_id
_entity_poly.pdbx_target_identifier
1 'polypeptide(L)' no no
;AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS
GFRNSDRILYSSDWLIYKTTDHYQTFTKIR
;
;AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTS
GFRNSDRILYSSDWLIYKTTDHYQTFTKIR
;
A,B,C ?
2 'polypeptide(L)' no no
;KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE
GADITIILS
;
;KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDALTGWVEYPLVLEWRQFEQSKQLTENGAESVLQVFREAKAE
GADITIILS
;
D,E,F ?
This is the updated Python code:
import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.io.pdbx as pdbx
ID = "1BRS"
# Download structure
file_name = rcsb.fetch(ID, "pdbx", target_path=".")
# Read file
file = pdbx.PDBxFile()
file.read(file_name)
# Get 'entity_poly' category as dictionary
# to find out which chains are polymers
poly_chains = []
for chain_list in file["entity_poly"]["pdbx_strand_id"]:
poly_chains += chain_list.split(",")
# Get 'pdbx_struct_assembly_gen' category as dictionary
for asym_id_list in file["pdbx_struct_assembly_gen"]["asym_id_list"]:
chain_ids = asym_id_list.split(",")
# Filter chains that belong to a polymer
chain_ids = [chain_id for chain_id in chain_ids if chain_id in poly_chains]
print(f"{ID}_{':'.join(chain_ids)}")
And this is the output:
1BRS_A:D
1BRS_B:E
1BRS_C:F
Upvotes: 2