justaguy
justaguy

Reputation: 3022

Awk to split input a tab-delimited file using multiple delimiters in the same field

I am trying to use awk to split the file, skipping the header, into either an 8-column or 6-column output. I am not sure if I did the split correct though as I need to split $2 first by the : then by the -. The desired output of each awk is below as one or the other is used depending on the situation. Thank you :).

file 'tab-delimited`

Gene    Position    Strand
SMARCB1 22:24133967-24133967    +
RB1 13:49037865-49037865    -
SMARCB1 22:24176357-24176357    +

awk

awk -F'\t' -v OFS="\t" 'NR>1{split($2,a,":"); print a[1],a[2],a[3],"chr"$2,"0",$3,"GENE_ID="$1}'

8-column desired output tab-delimited

chr22   24133967    24133967    chr22:24133967-24133967 0   +   .   GENE_ID=SMARCB1
chr13   49037865    49037865    chr13:49037865-49037865 0   -   .   GENE_ID=RB1
chr22   24176357    24176357    chr22:24176357-24176357 0   +   .   GENE_ID=SMARCB1

awk

awk -F'\t' -v OFS="\t" 'NR>1{split($2,a,":"); print a[1],a[2],a[3],"chr"$2,".",$1,}'

6-column desired output tab-delimited

chr22   24133967    24133967    chr22:24133967-24133967 .   SMARCB1
chr13   49037865    49037865    chr13:49037865-49037865 .   RB1
chr22   24176357    24176357    chr22:24176357-24176357 .   SMARCB1

Upvotes: 0

Views: 437

Answers (1)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Extended approach:

For 6-column output:

awk -v c=6 'BEGIN{ FS=OFS="\t" }NR>1{ split($2,a,":|-"); k="chr"; 
             printf("%s\t%d\t%d\t%s\t",k a[1],a[2],a[3],k $2); 
             if (c==6) print ".",$1; else print "0",$3,".","GENE_ID="$1 }' file

The output:

chr22   24133967    24133967    chr22:24133967-24133967 .   SMARCB1
chr13   49037865    49037865    chr13:49037865-49037865 .   RB1
chr22   24176357    24176357    chr22:24176357-24176357 .   SMARCB1

For 8-column output (via passing -v c=<number> (column) variable):

awk -v c=8 'BEGIN{ FS=OFS="\t" }NR>1{ split($2,a,":|-"); k="chr"; 
             printf("%s\t%d\t%d\t%s\t",k a[1],a[2],a[3],k $2); 
             if (c==6) print ".",$1; else print "0",$3,".","GENE_ID="$1 }' file

The output:

chr22   24133967    24133967    chr22:24133967-24133967 0   +   .   GENE_ID=SMARCB1
chr13   49037865    49037865    chr13:49037865-49037865 0   -   .   GENE_ID=RB1
chr22   24176357    24176357    chr22:24176357-24176357 0   +   .   GENE_ID=SMARCB1

Upvotes: 2

Related Questions