Implementing equivalence in Biopython's PDB module

Question

Background

In the PDB module of Biopython, PDB structures are parsed into Structure objects, which store the components of the structure in a SMCRA archiecture (Structure/Model/Chain/Residue/Atom). Each level of this hierarchy is represented by an object that inherits the Entity container class.

Equivalence

My problem is that at no point can any two Entity objects be equal.

Structures built from the same file are not equal:

>>> from Bio import PDB
>>> parser = PDB.PDBParser()
>>> struct1 = parser.get_structure("1hgg", "pdb1hgg.ent")
>>> struct2 = parser.get_structure("1hgg", "pdb1hgg.ent")
>>> struct1 == struct2
False

Residues within that structure are not equal:

>>> first_res1 = struct1.get_residues().next()
>>> first_res2 = struct2.get_residues().next()
>>> first_res1 == first_res2
False

And so on.

If we were to parse the same PDB file separately, at no point could any of the Entity objects within the structures be equal.

Solution

The obvious solution to this problem is to never parse the same PDB file twice. Then, we have object identity and thus, equivalence. However, this answer seems incomplete to me.

Each Entity object can return an identification tuple with get_full_id(). This method gives all id's from the top object down; it should be unique for each Entity within a structure, and unique across all structures if the proper PDB id was supplied when constructing the Structure object.

My solution for testing Entity equivalence is merely to compare this full id. That is:

def __eq__(self, other):
    return self.get_full_id() == other.get_full_id()

Question

At this point, I'm asking if my implementation of Entity equivalence is sensible.

Are false positives (e.g. differing structures that were supplied the same PDB id) a worry?
Is it better to simply manually compare the full id's whenever we need to test equivalence?
And is there any reason why __eq__ was left unimplemented within the PDB module?

weronika · Accepted Answer

One common reason for not defining an __eq__ is that it makes things unhashable (so you can't use them as dictionary keys or put them in sets), unless you also define a consistent __hash__ function, and your objects are immutable.

By default __hash__ for objects just uses the ID, which works even for mutable objects, since the ID never changes. But if you define a custom __eq__, you can't keep hashing by ID, or you'll get a situation where two objects can compare as equal but have different hashes, which is inconsistent with how hashing is supposed to work. So you have to define a custom __hash__ function (which you can do), but then if your object is mutable, you can't/shouldn't really do that, either, so you'll just have an unhashable object. Which may be all right for you.

See more info in the python docs here.

So you can use a custom __eq__ as long as you don't need your objects to be hashable, or if they're immutable; otherwise things get more complicated. Or you could just leave __eq__ alone and name your full ID comparison function something else, so as to not break hashability.

I don't know enough about what PDB IDs mean (in particular, whether false positives are possible) to tell whether your __eq__ implementation is reasonable from that standpoint.

Implementing equivalence in Biopython's PDB module

Background

Equivalence

Solution

Question

Answers (1)

Related Questions

Implementing equivalence in Biopython&#39;s PDB module

Background

Equivalence

Solution

Question

Answers (1)

Related Questions

Implementing equivalence in Biopython's PDB module