Reputation: 1150
I have a file that looks like:
t1 ATGCGTCCGTAGCAG
t2 ATGCCTAGCTAGGCT
i.e. a name followed be a (DNA) sequence. I would like to partition the sequence. For example, the above sequence have a length of 15 and I'd like to partition it into 3 parts of length 5. I want to have three new files such that:
t1 ATGCG
t2 ATGCC
t1 TCCGT
t2 TAGCT
t1 AGCAG
t2 AGGCT
I am trying to write a shell script to accomplish this. One way would be to write a for-loop to get the Nth line of the file using sed '$Nq;d'
, and then cutting it by cut -c
command and save it into a variable. Then, using cut, head, tail
commands and one more variable I will achieve it. But, I am wondering if there is a better way (neatness and speed) to do this.
PS: The actual files will contain 1-10 thousands of lines and each sequence has length of 10-50k length and I will partition the sequences into sequences of length 1-2k.
Upvotes: 3
Views: 156
Reputation: 1317
awk can help
awk '{for(i=1;i<=3;i++)print $1" "substr($2,5*(i-1)+1,5) >> "file"i".txt"}' inputfilename
expanding awk
awk '{
for(i=1;i<=3;i++)
print $1" "substr($2,5*(i-1)+1,5) >> "file"i".txt"
}' inputfilename
Upvotes: 1
Reputation: 53525
The following uses substring notation (i.e. string:start:length) to extract the requested output:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo ${line:0:10} >> file1
echo ${line:0:5} ${line:10:5} >> file2
echo ${line:0:5} ${line:15:5} >> file3
done < "$1"
Save it into myscript.sh and run it with: ./myscript.sh <input-file>
Upvotes: 2
Reputation: 198314
One-liner solution, uses a single loop:
for i in $(seq 3); do cut -c1-5,$((i * 5 + 1))-$(((i + 1) * 5)) < source.txt > file$i.txt ; done
Adjust the calculation for your own widths. You really don't need to do this line-by-line, it would be very slow.
Upvotes: 1