Reputation: 1150

Shell script: How to partition a file into columns?

I have a file that looks like:

t1   ATGCGTCCGTAGCAG
t2   ATGCCTAGCTAGGCT

i.e. a name followed be a (DNA) sequence. I would like to partition the sequence. For example, the above sequence have a length of 15 and I'd like to partition it into 3 parts of length 5. I want to have three new files such that:

file1

t1   ATGCG
t2   ATGCC

file2

t1   TCCGT
t2   TAGCT

file3

t1   AGCAG
t2   AGGCT

I am trying to write a shell script to accomplish this. One way would be to write a for-loop to get the Nth line of the file using sed '$Nq;d', and then cutting it by cut -c command and save it into a variable. Then, using cut, head, tail commands and one more variable I will achieve it. But, I am wondering if there is a better way (neatness and speed) to do this.

PS: The actual files will contain 1-10 thousands of lines and each sequence has length of 10-50k length and I will partition the sequences into sequences of length 1-2k.

Upvotes: 3

Answers (3)

Shravan Yadav

Reputation: 1317

awk can help

awk '{for(i=1;i<=3;i++)print $1" "substr($2,5*(i-1)+1,5) >> "file"i".txt"}' inputfilename

expanding awk

awk '{
        for(i=1;i<=3;i++)
          print $1" "substr($2,5*(i-1)+1,5) >> "file"i".txt"
     }' inputfilename

Upvotes: 1

Nir Alfasi

Reputation: 53525

The following uses substring notation (i.e. string:start:length) to extract the requested output:

#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
    echo ${line:0:10} >> file1
    echo ${line:0:5}  ${line:10:5} >> file2
    echo ${line:0:5}  ${line:15:5} >> file3
done < "$1"

Save it into myscript.sh and run it with: ./myscript.sh <input-file>

Upvotes: 2

Amadan

Reputation: 198314

One-liner solution, uses a single loop:

for i in $(seq 3); do cut -c1-5,$((i * 5 + 1))-$(((i + 1) * 5)) < source.txt > file$i.txt ; done

Adjust the calculation for your own widths. You really don't need to do this line-by-line, it would be very slow.