Reputation: 559
I have a file that looks like the following. I want to print the first, second, third, fourth, and fifth column, then split the eighth column and print between "EFF=" and the following "(" on each line and after splitting the eighth column between the the pipes "|" printing the sixth match.
chr1 10150 . C T 6.72 . DP=6;VDB=0.0074;AF1=0.2932;CLR=6;AC1=1;DP4=3,1,1,1;MQ=30;FQ=7.98;PV4=1,0.33,1,0.22;EFF=DOWNSTREAM(MODIFIER||4212|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1724|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/0:0,6,26:2:0:9 0/1:38,0,48:4:0:36
chr1 10291 . C T 3.55 . DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=52;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4071|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1583|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:31,3,0:1:0:5
chr1 10297 . C T 3.55 . DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=52;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4065|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1577|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:31,3,0:1:0:5
chr1 10327 . T C 3.02 . DP=3;VDB=0.0160;AF1=1;AC1=4;DP4=0,0,1,0;MQ=56;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4035|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1547|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ 0/1:30,3,0:1:0:5 0/0:0,0,0:0:0:3
output
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM
I can print the columns and the sixth element on the eighth column between the pipes "|" using the following, but not the string that matches between the "EFF=" and the next "(".
awk '{split($8,a,"|"); print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" a[8]}'
Upvotes: 0
Views: 77
Reputation: 203502
$ cat tst.awk
{
split($8,a,/[|(]|EFF=/)
print $1, $2, $3, $4, $5, a[8], a[2]
}
$ awk -f tst.awk file
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM
Upvotes: 0
Reputation: 36262
You can use match()
that uses a regular expression to match from EFF
until an opening parentheses. It returns in eff
variable the value EFF=DOWNSTREAM
so then use substr()
to extract the string between the equal sign and the opening parentheses, like:
awk '
{split($8,a,"|");
match($8, "EFF=[^(]*", eff);
print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" substr(eff[0], 5)}
' infile
It yields:
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM
UPDATE: You are using an old version (or at least the non-GNU) of awk
. And the match()
function only accepts two parameters so you have to play with RSTART
and RLENGTH
variables, try this version:
awk '
{split($8,a,"|");
pos = match($8, "EFF=[^(]*");
print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" substr($8, RSTART + 4, RLENGTH - 4)}
' infile
The result is the same that previous one.
Upvotes: 1