Reputation: 29
I have a file with about 16,000 lines in it. All of them have the same format. Here is a simple example if it:
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
<...>
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
I need to check if lines that contains string DPPC
and has identifier 18
forms 50 line block before the identifier switches to 19
, etc.
So for now, I have the following code:
cnt = 0
with open('test_file.pdb') as f1:
with open('out','a') as f2:
lines = f1.readlines()
for i, line in enumerate(lines):
if "DPPC" in line:
A = line.strip()[22:26]
if A[i] == A [i+1]:
cnt = cnt + 1
elif A[i] != A[i+1]:
cnt = 0
And here I stuck. I found some examples how to compare subsequent lines but similar approach did not work here. I still cannot figure out how to compare the value of A
in line[i]
with the value of A
in the line[i+1]
.
Upvotes: 2
Views: 124
Reputation: 123463
Since your data appears to be fixed width fields in fixed width records, you can use the struct
module to quickly break each line up into individual fields.
Parsing all the fields of each line may be overkill when you only need to process one of them, but I'm doing it the way shown to illustrate how it's done in case you need to do other processing — and using the struct
module makes it relatively fast in any case.
Let's say the input file consisted of only the following lines of data:
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 20 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 20 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 20 23.050 20.800 11.000 1.00 0.00
All you need to do is remember what the value of field was on the previous line to allow a comparison of it to the current one. To start the process, the first line has to be read and parsed separately, so there's a prev
value to compare with on subsequent lines. Also note that the 5th field is the one indexed by [4]
because the first starts at [0]
.
import struct
# negative widths represent ignored padding fields
fieldwidths = 4, -4, 3, -2, 2, -2, 4, -3, 2, -6, 6, -2, 6, -2, 6, -2, 4, -2, 4
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from # a function to split line up into fields
with open('test_file.pdb') as f1:
prev = parse(next(f1))[4] # remember value of fifth field
cnt = 1
for line in f1:
curr = parse(line)[4] # get value of fifth field
if curr == prev: # same as last one?
cnt += 1
else:
print('{} occurred {} times'.format(prev, cnt))
prev = curr
cnt = 1
print('{} occurred {} times'.format(prev, cnt)) # for last line
Output:
18 occurred 9 times
19 occurred 7 times
20 occurred 3 times
Upvotes: 1
Reputation: 1557
You can also easily solve this with a parallel list:
data = []
with open('data.txt', 'r') as datafile:
for line in datafile:
line=line.strip()
if line:
data.append(line);
keywordList = []
for line in data:
line = line.split()
if (line[4] not in keywordList):
keywordList.append(line[4])
counterList = []
for item in keywordList:
counter = 0
for line in data:
line = line.split()
if (line[4] == item):
counter+=1
counterList.append(counter)
for i in range(len(keywordList)):
print("%s: %d"%(keywordList[i],counterList[i]));
But if you are familiar with dict, I'll go with Lutz's answer.
Upvotes: 0
Reputation:
Try this (explanations in the comments).
data = """ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 139 C1 DPPC 18 17.250 58.420 10.850 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00
ATOM 189 C1 DPPC 19 23.050 20.800 11.000 1.00 0.00"""
# The last code seen in the 5th column.
code = None
# The count of lines of the current code.
count = 0
for line in data.split("\n"):
# Get the 5th column.
c = line.split()[4]
# The code in the 5th column changed.
if c != code:
# If we aren't at the start of the file, print the count
# for the code that just ended.
if code:
print("{}: {}".format(code, count))
# Rember the new code.
code = c
# Count the line
count = count + 1
# Print the count for the last code.
print("{}: {}".format(code, count))
Output:
18: 9
19: 19
Upvotes: 2