Reputation: 9782
I am using cut to extract columns in a tab delim file:
cut -f 14 glra3res.vcf
where the result of this is:
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CANONICAL=YES;CCDS=CCDS54942.1;ENSP=ENSP00000411593;SWISSPROT=P23415;UNIPARC=UPI0000DA6BF2;SIFT=deleterious(0.02);PolyPhen=benign(0.167);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000455880.2:c.1363C>A;HGVSp=ENSP00000411593.2:p.His455Asn;AA_MAF=T:0;EA_MAF=T:0.000116
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CCDS=CCDS4320.1;ENSP=ENSP00000274576;SWISSPROT=P23415;TREMBL=Q14C71;UNIPARC=UPI000013DA17;SIFT=deleterious(0.02);PolyPhen=benign(0.315);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000274576.6:c.1339C>A;HGVSp=ENSP00000274576.4:p.His447Asn;AA_MAF=T:0;EA_MAF=T:0.000116
I want to extract the string between SYMBOL=
and ;
, which would result in GLRA3
.
I am trying to pipe this into a grep
command:
cut -f 14 glra1res.vcf | grep 'SYMBOL='
which of course picks out SYMBOL=
and I can also pick out only ;
. I am having difficulty combining the two to get the strings between them. simply doing
cut -f 14 glra1res.vcf | grep 'SYMBOL=' | grep ';'
Ignores the SYMBOL=
, and I though if I could pick both out then that would be a start....
Upvotes: 0
Views: 175
Reputation: 6378
With perl if you split on both ;
and =
you can build a hash of hashes for each errm "gene"
(?) or line in the file. This example uses the "topic" variables $_
%_
and "autosplit" array @F
(made with -a
, -F
see perlrun
for details on switches) to print out the value of the "SYMBOL" key from the default has (%_
):
perl -F"/;|=/" -anE '$_{$.}={@F} ;}{ say $_{$_}{SYMBOL} for keys %_' data.txt
That way you can pick which value you want to print out by changing the key - e.g.:
perl -F"/;|=/" -anE '$_{$.}={@F} ;}{ say $_{$_}{CCDS} for keys %_' data.txt
An array of hashes is possible too of course:
perl -F"/;|=/" -anE 'push @genes, {@F} ;}{ say ${$_}{CCDS} for @genes' data.txt
I find if I start using data structures right away (even in a one-liner) it makes it easier to start imagining a longer script or application. One of the nicest tools for doing that is Data::Printer
which lets you "see" into hashes and arrays: perl -MDDP -F"/;|=/" -lane '$_{$.}={@F};}{ p %_' data.txt
Upvotes: 0
Reputation: 902
You can even try this in perl one liner:
InputFile:
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CANONICAL=YES;CCDS=CCDS54942.1;ENSP=ENSP00000411593;SWISSPROT=P23415;UNIPARC=UPI0000DA6BF2;SIFT=deleterious(0.02);PolyPhen=benign(0.167);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000455880.2:c.1363C>A;HGVSp=ENSP00000411593.2:p.His455Asn;AA_MAF=T:0;EA_MAF=T:0.000116
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CCDS=CCDS4320.1;ENSP=ENSP00000274576;SWISSPROT=P23415;TREMBL=Q14C71;UNIPARC=UPI000013DA17;SIFT=deleterious(0.02);PolyPhen=benign(0.315);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000274576.6:c.1339C>A;HGVSp=ENSP00000274576.4:p.His447Asn;AA_MAF=T:0;EA_MAF=T:0.000116
Code: (Windows prompt)
perl -lne "if($_ =~ /SYMBOL=(.*?[^;]);/i) { print $1;}" InputFile
Shell prompt:
perl -lne 'if($_ =~ /SYMBOL=(.*?[^;]);/i) { print $1;}' InputFile
Output:
GLRA3
GLRA3
Upvotes: 0
Reputation: 290165
This can be done with grep
and look-behind:
... | grep -Po '(?<=SYMBOL=)[^;]*'
GLRA3
GLRA3
It gets [^;]*
when it occurs after SYMBOL=
. And [^;]*
means "any set of characters until a ;
is found".
Note you were not that far from the solution. If you do the following with -o
, you get to print what goes after SYMBOL=
and until ;
is found:
... | grep -o 'SYMBOL=[^;]*'
SYMBOL=GLRA3
SYMBOL=GLRA3
Then you can add the -P
option to perform \K
, which removes the previous matched text and just prints what goes next:
... | grep -Po 'SYMBOL=\K[^;]*'
GLRA3
GLRA3
Upvotes: 4
Reputation: 204259
You don't need a bunch of different commands and pipes, just one simple awk command. Look, imagine you have this tab-separated file that you currently run cut on:
$ cat file
abc STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC def
gh STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC ij
$ cut -f2 file
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC
Now just run this awk script on it instead:
$ awk -F'\t' '{split($2,a,/[;=]/); print a[4]}' file
GLRA3
GLRA3
Change $2
to $14
for your real file.
If "SYMBOL" isn't always in the same location just create an array mapping names to values and print whatever value you like by its name:
$ awk -F'\t' '{split($2,a,/[;=]/); for (i=1;i in a;i+=2) n2v[a[i]]=a[i+1]; print n2v["SYMBOL"]}' file
GLRA3
GLRA3
$ awk -F'\t' '{split($2,a,/[;=]/); for (i=1;i in a;i+=2) n2v[a[i]]=a[i+1]; print n2v["STRAND"]}' file
-1
-1
$ awk -F'\t' '{split($2,a,/[;=]/); for (i=1;i in a;i+=2) n2v[a[i]]=a[i+1]; print n2v["SYMBOL_SOURCE"]}' file
HGNC
HGNC
$ awk -F'\t' '{
split($2,a,/[;=]/)
for (i=1;i in a;i+=2) {
n2v[a[i]]=a[i+1]
}
for (name in n2v) {
print name, "->", n2v[name]
}
}' file
SYMBOL -> GLRA3
STRAND -> -1
SYMBOL_SOURCE -> HGNC
SYMBOL -> GLRA3
STRAND -> -1
SYMBOL_SOURCE -> HGNC
Upvotes: 1
Reputation: 16566
If you don't mind using sed:
bash-3.2$ cut -f 14 myfile | sed 's/.*SYMBOL=\([^;]*\);.*/\1/g'
GLRA3
GLRA3
And using only cut with the -d
option:
bash-3.2$ cut -f 14 myfile | cut -d';' -f 2|cut -d'=' -f 2
GLRA3
GLRA3
Upvotes: 4