Reputation: 205
I have a text file full of amino acids (CA-Final.txt) as well as some other data. Here is a snippet of the text file
ATOM 109 CA ASER A 48 10.832 19.066 -2.324 0.50 61.96 C
ATOM 121 CA AALA A 49 12.327 22.569 -2.163 0.50 60.22 C
ATOM 131 CA AGLN A 50 8.976 24.342 -1.742 0.50 56.71 C
ATOM 145 CA APRO A 51 7.689 25.565 1.689 0.50 51.89 C
ATOM 158 CA GLN A 52 5.174 23.336 3.467 1.00 43.45 C
ATOM 167 CA HIS A 53 2.339 24.135 5.889 1.00 38.39 C
ATOM 177 CA PHE A 54 0.900 22.203 8.827 1.00 33.79 C
ATOM 188 CA TYR A 55 -1.217 22.065 11.975 1.00 34.89 C
ATOM 200 CA ALA A 56 0.334 20.465 15.090 1.00 31.84 C
ATOM 205 CA VAL A 57 0.000 20.066 18.885 1.00 30.46 C
ATOM 212 CA VAL A 58 2.738 21.762 20.915 1.00 27.28 C
Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation. Here is my code so far
def Trimmer(txtFileName):
i = open('CA-final.txt', 'w')
j = open(txtFileName, 'r')
for record in j:
with open(txtFileName, 'r') as j:
content= j.read()
content_new = re.sub('^ATOM\s+\d+\s+CA\s+A[ADTSEPGCVMILYFHKRWQN]', r'^ATOM\s+\d+\s+CA\s+[ADTSEPGCVMILYFHKRWQN]', content, flags = re.M)
When I run the function, it returns an error
File "C:\Users\UserName\AppData\Local\conda\conda\envs\biopython\lib\sre_parse.py", line 1024, in parse_template
raise s.error('bad escape %s' % this, len(this))
error: bad escape \s
My idea is that this function will find every instance of an A in front of a string of 3 characters and replace it with just the 3 other characters. Why exactly am I getting this error?
Upvotes: 1
Views: 372
Reputation: 744
As far as I know, the easiest way to achieve your goal right now is to parse it using biopython (Since it's a PDB file).
Let's analyze the following script:
#!/usr/bin/env python3
import Bio
print("Biopython v" + Bio.__version__)
from Bio.PDB import PDBParser
from Bio.PDB import PDBIO
# Parse and get basic information
parser=PDBParser()
protein_1p49 = parser.get_structure('STS', '1p49.pdb')
protein_1p49_resolution = protein_1p49.header["resolution"]
protein_1p49_keywords = protein_1p49.header["keywords"]
print("Sample name: " + str(protein_1p49))
print("Resolution: " + str(protein_1p49_resolution))
print("Keywords: " + str(protein_1p49_keywords))
print("Model: " + str(protein_1p49[0]))
#initialize IO
io=PDBIO()
#custom select
class Select():
def accept_model(self, model):
return True
def accept_chain(self, chain):
return True
def accept_residue(self, residue):
# print("residue id:" + str(residue.get_id()))
print("residue name:" + str(residue.get_resname()))
if len(str(residue.get_resname()))>3:
print("Alert! abbr longer that 3 letters" + residue.get_resname())
exit(1)
return True
def accept_atom(self, atom):
# print("atom id:" + atom.get_id())
# print("atom name:" + atom.get_name())
if atom.get_name() == 'CA':
return True
else:
return False
#write to output file
io.set_structure(protein_1p49)
io.save("1p49_out.pdb", Select())
exit(0)
It parses a PDB structure and uses a build-in biopython class PDBIO to save a custom parts of protein structure. Notice that you can put custom logic within the Select sub-class.
In this example, I used accept_residue method to fetch me information about abnormally named residues in my protein structure. You can easily extend this and perform a simple string trimming inside this function.
Upvotes: 1
Reputation: 10930
Your regex will fail, if the first of three letters is an 'A'
. Try this instead:
(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)
It creates 2 Groups with what's before and after the extra 'A'
Then replace with the 2 Groups:
\1\2
Upvotes: 0