M.sh
M.sh

Reputation: 167

shell script (with loop) to grep a list of strings one by one

I have a big data text file (more than 100,000 rows) in this format:

0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc
0.00118343;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00276134;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;
0.0607495;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00670611;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;

Now, each row contains a gene name, such as in initial 4 rows there is CLCNKA gene. I am using grep command to count the frequency of each gene name in this data file, as:

grep -w "CLCNKA" my_data_file | wc -l

There are about 300 genes in a separate file which are to be searched in above data file. Can some expert please write a simple shell script with a loop to take gene name from a list one by one, and store its frequency in a separate file. So, the output file would be like this:

CLCNKA    4
XDH    2
GRK4    4

Upvotes: 0

Views: 1999

Answers (7)

Ed Morton
Ed Morton

Reputation: 204731

You've confused us. I and some others think all you want is a count of each gene in the file since that's what your input/output and some of your descriptive text states (count the frequency of each gene name in this data file) which would just be this:

$ awk -F'[=;]' '{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' file
GRK4 4
CLCNKA 4
XDH 2

while everyone else thinks you want a count of specific genes that exist in a different file since that's what your Subject line, proposed algorithm and the rest of your text states.

If everyone else is right then you'd need this tweak to read the "genes" file first and only count the genes in "file" that were listed in "genes":

awk -F'[=;]' 'NR==FNR{genes[$0]; next} $11 in genes{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' genes file
GRK4 4
CLCNKA 4
XDH 2

Your example doesn't help since it would produce the same output with either interpretation of your requirements so edit your question to clarify what it is you want. In particular if there are genes that you do NOT want counted then include lines containing those in the sample input.

Upvotes: 2

karakfa
karakfa

Reputation: 67567

if you only search for a list of genes, an inefficient but straightforward way

read g; do echo -n $g " "; grep -c $g file; done < genes

assuming your genes are listed one at a time in the genes file.

If your file structure is fixed, a more efficient version will be

awk 'NR==FNR{genes[$1];next} 
            {sub(/Gene.refGene=/,"",$6)} 
 $6 in genes{count[$6]++} 
         END{for(g in count) print g,count[g]}' genes FS=';' file

Upvotes: 0

Sundeep
Sundeep

Reputation: 23707

To preserve order provided input file is sorted as given in sample:

$ perl -lne '
($g) = /Gene\.refGene=([^;]+)/;
if($g ne $p && $. > 1)
{
    print "$p\t$c";
    $c = 0;
}
$c++; $p = $g;
END { print "$p\t$c" }' ip.txt
CLCNKA  4
XDH     2
GRK4    4


If not, use hash variable to increment gene name used as key and an array to store key order

$ perl -lne '
($k) = /Gene\.refGene=([^;]+)/;
push(@o, $k) if !$h{$k}++;
END { print "$_\t$h{$_}" foreach (@o) }' ip.txt
CLCNKA  4
XDH     2
GRK4    4

Upvotes: 0

Laser
Laser

Reputation: 6960

Here is one-liner:

sed "s/.*Gene.refGene=//;s/\;.*//" test | sort | uniq -c | awk '{print $2,$1}'

sed - will remove everything from line except gene name
sort will do sorting by name
uniq -c - will count number of gene repeats
awk with swap uniq output (by default it : count pattern)

Upvotes: 0

J&#252;rgen H&#246;tzel
J&#252;rgen H&#246;tzel

Reputation: 19757

A simpler solution relying on the uniq command:

#!/bin/bash

cut -d ';' -f 6|cut -d = -f 2|sort|uniq -c|while read -a kv;do
    echo  ${kv[1]} ${kv[0]}
done

Upvotes: 1

J&#252;rgen H&#246;tzel
J&#252;rgen H&#246;tzel

Reputation: 19757

This can also be done in pure bash, by using the associative array feature to count the frequencies:

#!/bin/bash

# declare assoc array
declare -A freq

# split stdin input csv
for gene in $(cut -d ';' -f 6|cut -d = -f 2);do
    let freq[$gene]++
done

# loop over array keys
for key in ${!freq[@]}; do
    echo ${key} ${freq[$key]}
done

Upvotes: 1

sjsam
sjsam

Reputation: 21975

awk is your friend

awk '{sub(/^.*Gene\.refGene=/,"");sub(/;.*$/,"");
     genelist[$0]++}END{for(i in genelist){print i,genelist[i]}}' file

Output

GRK4 4
CLCNKA 4
XDH 2

Sidenote: This may not give you the gene name frequency in the order in which they appear in the file. I guess that is not a requirement afterall.

Upvotes: 2

Related Questions