Reputation: 43
NOTE: The solution needs to be something I can embed in python.
I have a file with 800,000+ lines. The lines are grouped. The beginning of each group of rows starts with "IMAGE" followed by one row that starts with "HISTO" and then at least one, but usually multiple, rows that start with "FRAG".
I need to:
1. Delete/discard any row that starts with "HISTO".
2. For each "FRAG" line I need to join it with the previous "IMAGE" row.
Here is an example.
IMAGE ...data1...
HISTO usually numbers 0 0 1 1 0 1 0
FRAG ...data1...
FRAG ...data2...
IMAGE ...data2...
HISTO usually numbers 0 0 1 1 0 1 0
FRAG ...data1...
FRAG ...data2...
FRAG ...data3...
FRAG ...data4...
The result needs to look like this:
IMAGE ...data1... FRAG ...data1...
IMAGE ...data1... FRAG ...data2...
IMAGE ...data2... FRAG ...data1...
IMAGE ...data2... FRAG ...data2...
IMAGE ...data2... FRAG ...data3...
IMAGE ...data2... FRAG ...data4...
It is possible to have many FRAG lines before it starts over with an IMAGE line.
This is based on a previous question but now I need to use python for some consistency. Here is the code I was using that works.
> sed 's/>//' Input.txt|awk '/^IMAGE/{a=$0;next;} /^FRAG/{print ">"a,$0}'
Credit to AwkMan for the previous solution.
Upvotes: 0
Views: 110
Reputation: 2562
Try this solution:
with open('in.txt', 'r') as fin, open('out.txt', 'w') as fout:
for line in fin:
if line.startswith('HISTO'):
continue
elif line.startswith('IMAGE'):
prefix = line.strip()
elif line.startswith('FRAG'):
fout.write(prefix + ' ' + line)
Consider also that when you have an already working line command, like "John1024" awk command, you can execute it with a subprocess:
import subprocess
with open('input.txt', 'r') as fin, open('out.txt', 'w') as fout:
subprocess.run(["awk", "/^IMAGE/{img=$0;next} /^HISTO/{next} {print img,substr($0,1)}", "input.txt"], stdout=fout)
Upvotes: 0
Reputation: 113824
with open('Input.txt') as f:
for line in f:
line = line.rstrip()
if line.startswith('>IMAGE'):
img = line
continue
if line.startswith('>HIST'):
continue
print('%s %s' % (img, line.lstrip('>')))
This produces:
>IMAGE ...data1... FRAG ...data1...
>IMAGE ...data1... FRAG ...data2...
>IMAGE ...data2... FRAG ...data1...
>IMAGE ...data2... FRAG ...data2...
>IMAGE ...data2... FRAG ...data3...
>IMAGE ...data2... FRAG ...data4...
Try:
awk '/^>IMAGE/{img=$0;next} /^>HISTO/{next} {print img,substr($0,2)}' Input.txt
With this as the input file:
$ cat Input.txt
>IMAGE ...data1...
>HISTO usually numbers 0 0 1 1 0 1 0
>FRAG ...data1...
>FRAG ...data2...
>IMAGE ...data2...
>HISTO usually numbers 0 0 1 1 0 1 0
>FRAG ...data1...
>FRAG ...data2...
>FRAG ...data3...
>FRAG ...data4...
Our code produces:
$ awk '/^>IMAGE/{img=$0;next} /^>HISTO/{next} {print img,substr($0,2)}' Input.txt
>IMAGE ...data1... FRAG ...data1...
>IMAGE ...data1... FRAG ...data2...
>IMAGE ...data2... FRAG ...data1...
>IMAGE ...data2... FRAG ...data2...
>IMAGE ...data2... FRAG ...data3...
>IMAGE ...data2... FRAG ...data4...
Awk implicitly reads through a file line by line. We save the IMAGE line in the variable img
and print out FRAG lines as they occur.
In more detail:
/^>IMAGE/{img=$0;next}
For any line that begins with >IMAGE
, we save the line in the variable img
and then skip the rest of the commands and jump to start over on the next
line.
/^>HISTO/{next}
For any line that begins with >HISTO
, we skip the rest of the commands and jump to start over on the next
line.
print img,substr($0,2)
For all other lines, we print img
followed by the current line minus its first character (which is >
in the sample input).
Upvotes: 1