brucezepplin
brucezepplin

Reputation: 9782

extract string between two strings of a cut result

I am using cut to extract columns in a tab delim file:

cut -f 14 glra3res.vcf

where the result of this is:

STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CANONICAL=YES;CCDS=CCDS54942.1;ENSP=ENSP00000411593;SWISSPROT=P23415;UNIPARC=UPI0000DA6BF2;SIFT=deleterious(0.02);PolyPhen=benign(0.167);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000455880.2:c.1363C>A;HGVSp=ENSP00000411593.2:p.His455Asn;AA_MAF=T:0;EA_MAF=T:0.000116
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CCDS=CCDS4320.1;ENSP=ENSP00000274576;SWISSPROT=P23415;TREMBL=Q14C71;UNIPARC=UPI000013DA17;SIFT=deleterious(0.02);PolyPhen=benign(0.315);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000274576.6:c.1339C>A;HGVSp=ENSP00000274576.4:p.His447Asn;AA_MAF=T:0;EA_MAF=T:0.000116

I want to extract the string between SYMBOL= and ;, which would result in GLRA3.

I am trying to pipe this into a grep command:

cut -f 14 glra1res.vcf | grep 'SYMBOL='

which of course picks out SYMBOL= and I can also pick out only ;. I am having difficulty combining the two to get the strings between them. simply doing

cut -f 14 glra1res.vcf | grep 'SYMBOL=' | grep ';'

Ignores the SYMBOL=, and I though if I could pick both out then that would be a start....

Upvotes: 0

Views: 175

Answers (5)

G. Cito
G. Cito

Reputation: 6378

With perl if you split on both ; and = you can build a hash of hashes for each errm "gene" (?) or line in the file. This example uses the "topic" variables $_ %_ and "autosplit" array @F (made with -a, -F see perlrun for details on switches) to print out the value of the "SYMBOL" key from the default has (%_):

perl -F"/;|=/" -anE '$_{$.}={@F} ;}{ say $_{$_}{SYMBOL} for keys %_' data.txt

That way you can pick which value you want to print out by changing the key - e.g.:

perl -F"/;|=/" -anE '$_{$.}={@F} ;}{ say $_{$_}{CCDS} for keys %_' data.txt

An array of hashes is possible too of course:

perl -F"/;|=/" -anE 'push @genes, {@F} ;}{ say ${$_}{CCDS} for @genes' data.txt

I find if I start using data structures right away (even in a one-liner) it makes it easier to start imagining a longer script or application. One of the nicest tools for doing that is Data::Printer which lets you "see" into hashes and arrays: perl -MDDP -F"/;|=/" -lane '$_{$.}={@F};}{ p %_' data.txt

Upvotes: 0

Praveen
Praveen

Reputation: 902

You can even try this in perl one liner:

InputFile:

STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CANONICAL=YES;CCDS=CCDS54942.1;ENSP=ENSP00000411593;SWISSPROT=P23415;UNIPARC=UPI0000DA6BF2;SIFT=deleterious(0.02);PolyPhen=benign(0.167);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000455880.2:c.1363C>A;HGVSp=ENSP00000411593.2:p.His455Asn;AA_MAF=T:0;EA_MAF=T:0.000116
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC;HGNC_ID=HGNC:4326;BIOTYPE=protein_coding;CCDS=CCDS4320.1;ENSP=ENSP00000274576;SWISSPROT=P23415;TREMBL=Q14C71;UNIPARC=UPI000013DA17;SIFT=deleterious(0.02);PolyPhen=benign(0.315);EXON=9/9;DOMAINS=Superfamily_domains:SSF90112;HGVSc=ENST00000274576.6:c.1339C>A;HGVSp=ENSP00000274576.4:p.His447Asn;AA_MAF=T:0;EA_MAF=T:0.000116

Code: (Windows prompt)

perl -lne "if($_ =~ /SYMBOL=(.*?[^;]);/i) { print $1;}" InputFile

Shell prompt:

perl -lne 'if($_ =~ /SYMBOL=(.*?[^;]);/i) { print $1;}' InputFile

Output:

GLRA3
GLRA3

Upvotes: 0

fedorqui
fedorqui

Reputation: 290165

This can be done with grep and look-behind:

... | grep -Po '(?<=SYMBOL=)[^;]*'
GLRA3
GLRA3

It gets [^;]* when it occurs after SYMBOL=. And [^;]* means "any set of characters until a ; is found".


Note you were not that far from the solution. If you do the following with -o, you get to print what goes after SYMBOL= and until ; is found:

... | grep -o 'SYMBOL=[^;]*'
SYMBOL=GLRA3
SYMBOL=GLRA3

Then you can add the -P option to perform \K, which removes the previous matched text and just prints what goes next:

... | grep -Po 'SYMBOL=\K[^;]*'
GLRA3
GLRA3

Upvotes: 4

Ed Morton
Ed Morton

Reputation: 204259

You don't need a bunch of different commands and pipes, just one simple awk command. Look, imagine you have this tab-separated file that you currently run cut on:

$ cat file
abc     STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC       def
gh      STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC       ij

$ cut -f2 file
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC
STRAND=-1;SYMBOL=GLRA3;SYMBOL_SOURCE=HGNC

Now just run this awk script on it instead:

$ awk -F'\t' '{split($2,a,/[;=]/); print a[4]}' file
GLRA3
GLRA3

Change $2 to $14 for your real file.

If "SYMBOL" isn't always in the same location just create an array mapping names to values and print whatever value you like by its name:

$ awk -F'\t' '{split($2,a,/[;=]/); for (i=1;i in a;i+=2) n2v[a[i]]=a[i+1]; print n2v["SYMBOL"]}' file
GLRA3
GLRA3

$ awk -F'\t' '{split($2,a,/[;=]/); for (i=1;i in a;i+=2) n2v[a[i]]=a[i+1]; print n2v["STRAND"]}' file
-1
-1

$ awk -F'\t' '{split($2,a,/[;=]/); for (i=1;i in a;i+=2) n2v[a[i]]=a[i+1]; print n2v["SYMBOL_SOURCE"]}' file
HGNC
HGNC

$ awk -F'\t' '{
    split($2,a,/[;=]/)
    for (i=1;i in a;i+=2) {
        n2v[a[i]]=a[i+1]
    }
    for (name in n2v) {
        print name, "->", n2v[name]
    }
}' file
SYMBOL -> GLRA3
STRAND -> -1
SYMBOL_SOURCE -> HGNC
SYMBOL -> GLRA3
STRAND -> -1
SYMBOL_SOURCE -> HGNC

Upvotes: 1

fredtantini
fredtantini

Reputation: 16566

If you don't mind using sed:

bash-3.2$ cut -f 14 myfile | sed 's/.*SYMBOL=\([^;]*\);.*/\1/g'
GLRA3
GLRA3

And using only cut with the -d option:

bash-3.2$ cut -f 14 myfile | cut -d';' -f 2|cut -d'=' -f 2
GLRA3
GLRA3

Upvotes: 4

Related Questions