Geroge
Geroge

Reputation: 561

separate 8th field

I could not separate my file:

chr2    215672546   rs6435862   G   T   54.00   LowDP;sb DP=10;TI=NM_000465;GI=BARD1;FC=Silent   ...   ...

I would like to print first seven fields and from 8th field print just DP=10 and GI=BARD1. DP in GI info is always in 8th field. Fields are continue (...) so 8th field is not last.

I know how to extract 8th field :

awk '{print $8}' PLZ-10_S2.vcf  | awk -F ";" '/DP/ {OFS="\t"} {print $1}' 

of course how to extract first seven fields, but how to pipe it together? Between all fields is tab.

Upvotes: 0

Views: 201

Answers (2)

Ed Morton
Ed Morton

Reputation: 203522

If DP= and GI= are always in the same position within $8:

$ awk 'BEGIN{FS=OFS="\t"} {split($8,a,/;/); $8=a[1]";"a[3]} 1' file
chr2    215672546       rs6435862       G       T       54.00   LowDP;sb       DP=10;GI=BARD1   ...     ...

If not:

$ awk 'BEGIN{FS=OFS="\t"} {split($8,a,/;/); $8=""; for (i=1;i in a;i++) $8 = $8 (a[i] ~ /^(DP|GI)=/ ? ($8?";":"") a[i] : "")} 1' file
chr2    215672546       rs6435862       G       T       54.00   LowDP;sb       DP=10;GI=BARD1   ...     ...

Upvotes: 2

Birei
Birei

Reputation: 36262

One way is to split() with semicolon the eight field and traverse all results to check which of them begin with DP or GI:

awk '
    BEGIN { FS = OFS = "\t" }

    { 
        split( $8, arr8, /;/ )
        $8 = "" 
        for ( i = 1; i <= length(arr8); i++ ) {
            if ( arr8[i] ~ /^(DP|GI)/ ) { 
                $8 = $8 arr8[i] ";" 
            }
        }
        $8 = substr( $8, 1, length($8) - 1 )
        print $0
    }
' infile

It yields:

chr2    215672546   rs6435862   G   T   54.00   LowDP;sb    DP=10;GI=BARD1  ... ...

Upvotes: 1

Related Questions