user3324491
user3324491

Reputation: 559

print between two pattern matches on same line

I have a file that looks like the following. I want to print the first, second, third, fourth, and fifth column, then split the eighth column and print between "EFF=" and the following "(" on each line and after splitting the eighth column between the the pipes "|" printing the sixth match.

chr1    10150   .   C   T   6.72    .   DP=6;VDB=0.0074;AF1=0.2932;CLR=6;AC1=1;DP4=3,1,1,1;MQ=30;FQ=7.98;PV4=1,0.33,1,0.22;EFF=DOWNSTREAM(MODIFIER||4212|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1724|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ  0/0:0,6,26:2:0:9    0/1:38,0,48:4:0:36
chr1    10291   .   C   T   3.55    .   DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=52;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4071|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1583|||DDX11L1||NON_CODING|NR_046018.2||1)    GT:PL:DP:SP:GQ  0/1:0,0,0:0:0:3 0/1:31,3,0:1:0:5
chr1    10297   .   C   T   3.55    .   DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=52;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4065|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1577|||DDX11L1||NON_CODING|NR_046018.2||1)    GT:PL:DP:SP:GQ  0/1:0,0,0:0:0:3 0/1:31,3,0:1:0:5
chr1    10327   .   T   C   3.02    .   DP=3;VDB=0.0160;AF1=1;AC1=4;DP4=0,0,1,0;MQ=56;FQ=-27.4;EFF=DOWNSTREAM(MODIFIER||4035|||WASH7P||NON_CODING|NR_024540.1||1),INTERGENIC(MODIFIER||||||||||1),UPSTREAM(MODIFIER||1547|||DDX11L1||NON_CODING|NR_046018.2||1) GT:PL:DP:SP:GQ  0/1:30,3,0:1:0:5    0/0:0,0,0:0:0:3

output

chr1    10150   .   C   T WASH7P DOWNSTREAM
chr1    10291   .   C   T WASH7P DOWNSTREAM
chr1    10297   .   C   T WASH7P DOWNSTREAM
chr1    10327   .   T   C WASH7P DOWNSTREAM

I can print the columns and the sixth element on the eighth column between the pipes "|" using the following, but not the string that matches between the "EFF=" and the next "(".

awk '{split($8,a,"|"); print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" a[8]}'

Upvotes: 0

Views: 77

Answers (2)

Ed Morton
Ed Morton

Reputation: 203502

$ cat tst.awk
{
    split($8,a,/[|(]|EFF=/)
    print $1, $2, $3, $4, $5, a[8], a[2]
}

$ awk -f tst.awk file
chr1 10150 . C T WASH7P DOWNSTREAM
chr1 10291 . C T WASH7P DOWNSTREAM
chr1 10297 . C T WASH7P DOWNSTREAM
chr1 10327 . T C WASH7P DOWNSTREAM

Upvotes: 0

Birei
Birei

Reputation: 36262

You can use match() that uses a regular expression to match from EFF until an opening parentheses. It returns in eff variable the value EFF=DOWNSTREAM so then use substr() to extract the string between the equal sign and the opening parentheses, like:

awk '
    {split($8,a,"|"); 
    match($8, "EFF=[^(]*", eff); 
    print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" substr(eff[0], 5)}
' infile

It yields:

chr1    10150   .       C       T       WASH7P  DOWNSTREAM
chr1    10291   .       C       T       WASH7P  DOWNSTREAM
chr1    10297   .       C       T       WASH7P  DOWNSTREAM
chr1    10327   .       T       C       WASH7P  DOWNSTREAM

UPDATE: You are using an old version (or at least the non-GNU) of awk. And the match() function only accepts two parameters so you have to play with RSTART and RLENGTH variables, try this version:

awk '
    {split($8,a,"|"); 
    pos = match($8, "EFF=[^(]*"); 
    print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" a[6] "\t" substr($8, RSTART + 4, RLENGTH - 4)}
' infile

The result is the same that previous one.

Upvotes: 1

Related Questions