Reputation: 623
I'm trying quite hard to write a script that "loopingly" extracts substrings from one file, while getting the information on where to cut from another file. I'm working in bash on MobaXterm. I have the file cut_positions.txt, which is tab delimited and shows name, start point, end point, length, comment:
k141_20066 103484 104617 1133 phnW
k141_20841 13200 14324 1124 phnW
k141_23852 69 452 383 phnW
k141_32328 1 180 179 phnW
and the string_file.txt with the name (it would be no problem to remove/add the ">" in one of the files) and the string (the original strings are way longer, up to 1.000.000 characters):
>k141_10671 CCTTCCCCCACACGCCGCTCTTCCGCTCTTGCTGGCC
>k141_10707 AGGCGGTATCAGACCTTGCCGCAACACTAAGCCCAGTAACGCTGTCGCCCTTATATCTGA
>k141_11190 CTTTTGTGACAGTGCAGGGCAATGGTGGATTTATCAGTATCGGGCAGAA
>k141_1479 AGCCGACAGCAGCGCCGAGGGCACATAATCCGATGACACGATGTCCAAAAGATCCGCCTCGGC
Now I want to use the input from the cut_positions.txt. I want to use the first column to match the right line, then the second column as start point of the substring and the fourth column as length of the substring. This should be done with all lines in cut_positions.txt and written to a new out.txt. To get closer I tried (with my original data):
➤ grep ">k141_28027\b" test_out_one_line.txt | awk '{print substr($2,57251,69)}'
TCACTTGAGCGCAATTATTCGCTCTCCGGCGGCGTCAGCATCAGCCTGATCATGCGTCACCAAAAGTGT
which worked well as handmade way. I figured out as well how to access the different elements in cut_positions.txt (here the first row in the second column):
awk -F '\t' 'NR==1{print $2}' cut_positions.txt
but I can't figure out how to turn this into a loop, as I don't know how to connect the different redirections, piping steps and so on that I used for the small steps. Any help is very much appreciated (and tell me, if you need more sample data)
thanks crazysantaclaus
Upvotes: 0
Views: 145
Reputation: 158020
The following script should work for you:
cut.awk
# We are reading two files: pos.txt and strings.txt
# NR is equal to FNR as long as we are reading the
# first file.
NR==FNR{
pos[">"$1]=$2 # Store the startpoint in an array pos (indexed by $1)
len[">"$1]=$4 # Store the length in an array len (indexed by $1)
next # skip the block below for pos.txt
}
# This runs on every line of strings.txt
$1 in pos {
# Extract a substring of $2 based on the position and length
# stored above
key=$1
mod=substr($2,pos[key],len[key])
$2=mod
print # Print the modified line
}
Call it like this:
awk -f cut.awk pos.txt strings.txt
One important thing to mention. substr()
assumes strings to start at index 1
- in opposite to most programming languages where strings start at index 0
. If the positions in pos.txt
are 0
based, the substr()
must become:
mod=substr($2,pos[key]+1,len[key])
I recommend to test it with simplified, meaningful versions of:
pos.txt
foo 2 5 3 phnW
bar 4 5 1 phnW
test 1 5 4 phnW
and strings.txt
>foo 123456
>bar 123456
>non 123456
Output:
>foo 234
>bar 4
Upvotes: 2