Reputation: 1
I'm a biologist and I'm quite new in programming, but nowadays i'm trying to improve; my background is not about informatics.
I`m quite stuck in a problem.
We've some information about molecules; each line that begins with ATOM represents one atom of the entire molecule. For example, the first two lines:
ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
We are supposed to count the number of distinct atoms; better said, the last item of every line (C
orN
in the e.g.)
We have already the function that drives us and extract the last item, but I'm quite stuck at this point, because we should write the code as if we don't know already which atoms we will find (though we know, because we have the entire list, and we have N
,C
,O
and S
)
Code we have:
def count_atom(molecule):
number_atoms = dict()
lines = molecule.split(os.linesep)
for line in lines:
if line.startswith('ATOM'):
atom = line[77].strip()
print atom
return number_atoms
results= count_atoms(molecule)
molecule
represents the entire list.
Upvotes: 0
Views: 1395
Reputation: 31679
Although all the answers are correct in terms of Python, we have lines from a PDB file:
Record Format
COLUMNS DATA TYPE FIELD DEFINITION
-------------------------------------------------------------------------------------
1 - 6 Record name "ATOM "
[...]
77 - 78 LString(2) element Element symbol, right-justified.
[...]
For elements like SE
lenium which exist in plenty of protein structures both characters [77-78] need to be taken in account, otherwise it will become S
ulfur or E
.
If you don't want to deal with the whole parsing issue yourself, you can use BioPython's PDB module in combination with any of the solutions above.
from Bio.PDB import PDBParser
from collections import Counter
parser = PDBParser()
structure = parser.get_structure('PHA-L', '1fat.pdb')
atoms = list()
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
atoms.append(atom.element)
print(Counter(atoms))
Counter({'C': 4570, 'O': 1463, 'N': 1207, 'MN': 4, 'CA': 4})
Upvotes: 1
Reputation: 17506
Welcome to Python!
Python has lots of useful modules that take care of common problems.
To solve your problem you can import Counter
from collections
:
from collections import Counter
>>> molecule = '''ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C'''
>>> Counter(line.split()[-1] for line in molecule.splitlines())
Counter({'C': 2, 'N': 1})
line.split()[-1]
gets the last word of the line in case you have elements that have longer chemical symbols, splitlines()
separates the lines from each other.
Counter
s can be added and subtracted from each other, which might be useful for you:
>>> mycount = Counter(line.split()[-1] for line in molecule.splitlines())
>>> mycount + mycount
Counter({'C': 4, 'N': 2})
This will give you not only the number of distinct atoms, but also the number of appearances throughout the entire molecule.
The number of distinct atoms can be retrieved by taking the len
of the Counter
):
>>> len(Counter(line.split()[-1] for line in molecule.splitlines()))
2
More elaborate example:
molecule = '''ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
ATOM 3 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Se
ATOM 4 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 5 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 6 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C'''
>>> Counter(line.split()[-1] for line in molecule.splitlines())
Counter({'C': 2, 'N': 1, 'Pu': 2, 'Se': 1})
>>> len(Counter(line.split()[-1] for line in molecule.splitlines()))
4
Upvotes: 1
Reputation: 380
Hope i understand you right, but you want to count the occurrence of the last char of the string?
molecule = '''ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Se
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C'''
def count_atoms(molecule):
number_atoms = dict()
lines = molecule.split(os.linesep)
for line in lines:
if line.startswith('ATOM'):
atom = line.split()[-1].strip()
if number_atoms.get(atom):
number_atoms[atom] += 1
else:
number_atoms.update({atom: 1})
return number_atoms
print(count_atoms(molecule))
Output:
print(count_atoms(molecule))
{'Se': 1, 'Pu': 2, 'N': 1, 'C': 2}
Upvotes: 2
Reputation: 502
lines = ['ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N', 'ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 C', 'ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N']
all_elements = {l.split()[-1] for l in lines}
counts = {element: 0 for element in all_elements}
for line in lines:
counts[line.split()[-1]] += 1
counts
{'C': 1, 'N': 2}
this is how you count number of atoms of each element, if you just need number of elements, you can just use len(counts)
Upvotes: 0
Reputation: 6575
As the lines of your example doesn't have same length, so try access data by index would be a bad idea, like you do in atom = line[77].strip()
As you said, the info that distinct the atoms is the last character. So you can access just the last character using the last item index notation from lists.
>>> data = "ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N"
>>> print(data[-1])
N
Upvotes: 0