Reputation: 311
I have a large file with lines like so
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
I want to extract ENSG00000223972.5, DDX11L1, chr1, 11869 and 14409. I have succeeded in the first two by:
awk 'BEGIN {FS="\""}; {print $2"\t"$6}' file.txt
I'm struggling to now extract the chr1, 11869 and 14409 as this will need a different feild seperator? How is this done on the same ;line??
Upvotes: 0
Views: 74
Reputation: 203512
$ awk -F'[ "]+' -v OFS='\t' '{print $1, $4, $5, $10, $16}' file
chr1 11869 14409 ENSG00000223972.5 DDX11L1
Upvotes: 1
Reputation: 4043
Try to use following command to extract what you want,
awk 'BEGIN {FS="\"";OFS="\t"}; {split($1,a,/[\ ]*/); print a[1],a[4],a[5],$2,$6}' file.txt
Brief explanation,
split($1,a,/[\ ]*/
: split $1
into the array a
, and the separators would be regex /[\ ]*/
a
as you desired.Upvotes: 1