Reputation: 43

Join and delete lines

NOTE: The solution needs to be something I can embed in python.

I have a file with 800,000+ lines. The lines are grouped. The beginning of each group of rows starts with "IMAGE" followed by one row that starts with "HISTO" and then at least one, but usually multiple, rows that start with "FRAG".

I need to:
1. Delete/discard any row that starts with "HISTO".
2. For each "FRAG" line I need to join it with the previous "IMAGE" row. Here is an example.

IMAGE ...data1...  
HISTO usually numbers 0 0 1 1 0 1 0  
FRAG ...data1...  
FRAG ...data2...  
IMAGE ...data2...  
HISTO usually numbers 0 0 1 1 0 1 0   
FRAG ...data1...  
FRAG ...data2...  
FRAG ...data3...  
FRAG ...data4...

The result needs to look like this:

IMAGE ...data1... FRAG ...data1...  
IMAGE ...data1... FRAG ...data2...  
IMAGE ...data2... FRAG ...data1...  
IMAGE ...data2... FRAG ...data2...  
IMAGE ...data2... FRAG ...data3...  
IMAGE ...data2... FRAG ...data4...

It is possible to have many FRAG lines before it starts over with an IMAGE line.

This is based on a previous question but now I need to use python for some consistency. Here is the code I was using that works.

> sed 's/>//' Input.txt|awk '/^IMAGE/{a=$0;next;} /^FRAG/{print ">"a,$0}'

Credit to AwkMan for the previous solution.

Upvotes: 0

Answers (2)

chapelo

Reputation: 2562

Try this solution:

with open('in.txt', 'r') as fin, open('out.txt', 'w') as fout:
    for line in fin:
        if line.startswith('HISTO'): 
            continue
        elif line.startswith('IMAGE'):
            prefix = line.strip()
        elif line.startswith('FRAG'):
            fout.write(prefix + ' ' + line)

Consider also that when you have an already working line command, like "John1024" awk command, you can execute it with a subprocess:

import subprocess
with open('input.txt', 'r') as fin, open('out.txt', 'w') as fout:
    subprocess.run(["awk", "/^IMAGE/{img=$0;next} /^HISTO/{next} {print img,substr($0,1)}", "input.txt"], stdout=fout)

Upvotes: 0

John1024

Reputation: 113824

Python solution

with open('Input.txt') as f:
    for line in f:
        line = line.rstrip()
        if line.startswith('>IMAGE'):
            img = line
            continue
        if line.startswith('>HIST'):
            continue
        print('%s %s' % (img, line.lstrip('>')))

This produces:

>IMAGE ...data1... FRAG ...data1...
>IMAGE ...data1... FRAG ...data2...
>IMAGE ...data2... FRAG ...data1...
>IMAGE ...data2... FRAG ...data2...
>IMAGE ...data2... FRAG ...data3...
>IMAGE ...data2... FRAG ...data4...

Awk solution

Try:

awk '/^>IMAGE/{img=$0;next} /^>HISTO/{next} {print img,substr($0,2)}' Input.txt

Example

With this as the input file:

$ cat Input.txt 
>IMAGE ...data1...
>HISTO usually numbers 0 0 1 1 0 1 0
>FRAG ...data1...
>FRAG ...data2...
>IMAGE ...data2...
>HISTO usually numbers 0 0 1 1 0 1 0
>FRAG ...data1...
>FRAG ...data2...
>FRAG ...data3...
>FRAG ...data4...

Our code produces:

$ awk '/^>IMAGE/{img=$0;next} /^>HISTO/{next} {print img,substr($0,2)}' Input.txt
>IMAGE ...data1... FRAG ...data1...
>IMAGE ...data1... FRAG ...data2...
>IMAGE ...data2... FRAG ...data1...
>IMAGE ...data2... FRAG ...data2...
>IMAGE ...data2... FRAG ...data3...
>IMAGE ...data2... FRAG ...data4...

How it works

Awk implicitly reads through a file line by line. We save the IMAGE line in the variable img and print out FRAG lines as they occur.

In more detail:

/^>IMAGE/{img=$0;next}

For any line that begins with >IMAGE, we save the line in the variable img and then skip the rest of the commands and jump to start over on the next line.
/^>HISTO/{next}

For any line that begins with >HISTO, we skip the rest of the commands and jump to start over on the next line.
print img,substr($0,2)

For all other lines, we print img followed by the current line minus its first character (which is > in the sample input).

Upvotes: 1