Reputation: 149
So I wanted to extract the following amino acid abbreviations from a piece of .pdb format data: ['GLU','PHE',...,'ASN']
ATOM 296 OE2 GLU A 43 18.414 12.323 8.758 1.00 32.23 O
ATOM 297 N PHE A 50 18.072 10.668 14.644 1.00 34.68 N
ATOM 298 CA PHE A 50 18.038 10.228 16.039 1.00 35.61 C
ATOM 299 C PHE A 50 18.501 11.321 17.019 1.00 35.86 C
ATOM 300 O PHE A 50 18.018 11.413 18.091 1.00 36.21 O
ATOM 301 CB PHE A 50 18.844 8.936 16.226 1.00 35.43 C
ATOM 302 CG PHE A 50 18.811 8.386 17.623 1.00 37.33 C
ATOM 303 CD1 PHE A 50 17.924 7.416 17.982 1.00 36.31 C
ATOM 304 CD2 PHE A 50 19.659 8.840 18.557 1.00 39.84 C
ATOM 305 CE1 PHE A 50 17.875 6.922 19.220 1.00 37.80 C
ATOM 306 CE2 PHE A 50 19.591 8.330 19.833 1.00 40.97 C
ATOM 307 CZ PHE A 50 18.709 7.368 20.144 1.00 37.91 C
ATOM 308 N ASN A 51 19.462 12.125 16.616 1.00 36.20 N ...
And I used this command in my python script:
residue=re.compile(r"(?<=ATOM...............)+?(?=..............\.)").findall(fpdb)
in hope to extract the target strings based on the format of the file by looking before and after the strings. But I only get an empty list, so I'm confused and need some help badly. Thanks!
Upvotes: 0
Views: 44
Reputation: 12015
Assuming there are no missing cell values, if you want to extract 3rd column (columns starting from 0) when there are 12 columns,
import re
re.split(r'\s+', fpdb)[3::12]
# ['GLU', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'ASN']
Upvotes: 0
Reputation: 82765
Using str.split()
Ex:
s = """ATOM 296 OE2 GLU A 43 18.414 12.323 8.758 1.00 32.23 O
ATOM 297 N PHE A 50 18.072 10.668 14.644 1.00 34.68 N
ATOM 298 CA PHE A 50 18.038 10.228 16.039 1.00 35.61 C
ATOM 299 C PHE A 50 18.501 11.321 17.019 1.00 35.86 C
ATOM 300 O PHE A 50 18.018 11.413 18.091 1.00 36.21 O
ATOM 301 CB PHE A 50 18.844 8.936 16.226 1.00 35.43 C
ATOM 302 CG PHE A 50 18.811 8.386 17.623 1.00 37.33 C
ATOM 303 CD1 PHE A 50 17.924 7.416 17.982 1.00 36.31 C
ATOM 304 CD2 PHE A 50 19.659 8.840 18.557 1.00 39.84 C
ATOM 305 CE1 PHE A 50 17.875 6.922 19.220 1.00 37.80 C
ATOM 306 CE2 PHE A 50 19.591 8.330 19.833 1.00 40.97 C
ATOM 307 CZ PHE A 50 18.709 7.368 20.144 1.00 37.91 C
ATOM 308 N ASN A 51 19.462 12.125 16.616 1.00 36.20 N"""
for i in s.split("\n"):
print(i.split()[3])
Output:
GLU
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
PHE
ASN
Using a list comprehension.
Ex:
data = [i.split()[3] for i in s.split("\n")]
print(data)
#['GLU', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'PHE', 'ASN']
Using Regex
import re
print( re.findall(r"ATOM\s+\d+\s+\w+\s+([A-Z]+)", s) )
Upvotes: 1