Reputation: 3173
i have a data files with many lines, example:
line1
line2
line3
line4
line5
line6
No D
line7
line8
line9
line10
line11
line12
...
whenever the program sees a line that's 'No D', I want it to write the 4 lines before the 'No D' line and also 2 lines after the 'No D' line to a file called "NoDregion.txt" , and writes what's left to a file called "WithDregion.txt".
my code:
lines =open("file.txt", "r").read().splitlines()
manualintervf = open("NoDregion.txt", "w")
goodlines = open("withDregion.txt", "w")
for i, line in enumerate(lines):
if ">No D" in line:
manualintervf.write(lines[i-4]+"\n"+lines[i-3]+"\n"+lines[i-2]+"\n"+lines[i-1]+"\n"+lines[i]+"\n"+lines[i+1]+"\n"+lines[i+2]+"\n")
else:
goodlines.write(line+"\n")
The 'NoDregion.txt" (no problem with this):
line3
line4
line5
line6
No D
line7
line8
The'withDregion.txt" (not the desired output, it only removes the 'No D' line and kept everything else, even the 4 lines before the 'No D' and the 2 lines after that line):
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10
line11
line12
...
the desired output for the "WithDregion.txt" would be:
line1
line2
line9
line10
line11
line12
...
I am not sure how to write it so that after the lines are written to the 'NoDregion.txt' they are removed, so that they won't be written to the 'withDregion.txt'
Upvotes: 0
Views: 208
Reputation: 104032
Since you are reading the entire file into memory, you may as well use a regex on the string from the file.
Try:
import re
with open(fn) as f:
txt=f.read()
txt=re.sub(r'(?:^.*\s){4}^No D.*$\s(?:^.*\s){2}', '', txt, flags=re.M)
# now just write txt to the output file...
Or, you can find each element in the list of file lines like so:
with open(fn) as f:
lines=f.readlines()
while True:
try:
idx=lines.index('No D\n')
lines=lines[0:idx-4]+lines[idx+2:]
except ValueError:
txt=''.join(lines)
break
# again, just write txt out to the output file....
Upvotes: 0
Reputation: 5115
Use this instead of your if/else
:
for i, line in enumerate(lines):
if ">No D" in line:
manualintervf.write(lines[i-4]+"\n"+lines[i-3]+"\n"+lines[i-2]+"\n"+lines[i-1]+"\n"+lines[i]+"\n"+lines[i+1]+"\n"+lines[i+2]+"\n")
goodlines.write("\n".join(lines[i+1:]))
This uses the join
method for strings. Try: "seperator".join(str(i) for i in range(10))
to get a feel for how it works. Basically it takes a list, and joins it together with seperator
(or whatever string you want) in between each element.
Upvotes: 0
Reputation: 1004
The reason you're seeing the behavior you are is that the lines that are near "No D" do not themselves contain "No D", so they're caught by your "else" clause.
The most straightforward way to do this is with two passes through the lines. Rather than immediately writing each line to a "D region" or "no D region" file, set a "D region" flag, then write the flagged lines to the "D region" and the non-flagged to the "no D region". Something like this:
lines =open("S011_PCR_ABCDEF-AnnotatedVDJAlignments.fasta", "r").read().splitlines()
manualintervf = open("NoDregion.txt", "w")
goodlines = open("withDregion.txt", "w")
d_before = 4
d_after = 3
d_flag = [0] * len(lines)
for i, line in enumerate(lines):
if "No D" in line:
dstart = max(i-d_before, 0)
dend = min(i+d_after, len(lines))
for j in range(dstart, dend):
d_flag[j] = 1
for i, line in enumerate(lines):
if d_flag[i] == 1:
manualintervf.write(line+"\n")
else:
goodlines.write(line+"\n")
Whatever method you use you're going to want to make sure your code doesn't fail if there's a "No D" as the first or last line of your file as well. If you have "No D" as the first line of the file, you'll be writing lines[-3]
to the NoDRegion, which is almost certainly not what you want, and if it's the last line you'll be trying to access a nonexistent line which is absolutely not what you want.
Upvotes: 1