Reputation: 167
I have a big data text file (more than 100,000 rows) in this format:
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc
0.00118343;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00276134;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;
0.0607495;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=CLCNKA;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.00670611;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000197239;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=XDH;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.000394477;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
0.0108481;AN=192;NS=2535;ANNOVAR_DATE=2015-12-14;Func.refGene=exonic;Gene.refGene=GRK4;GeneDetail.refGene=.;ExonicFunc.refGene=nonsynonymous_SNV;
Now, each row contains a gene name, such as in initial 4 rows there is CLCNKA
gene. I am using grep
command to count the frequency of each gene name in this data file, as:
grep -w "CLCNKA" my_data_file | wc -l
There are about 300 genes in a separate file which are to be searched in above data file. Can some expert please write a simple shell script
with a loop to take gene name from a list one by one, and store its frequency in a separate file. So, the output file would be like this:
CLCNKA 4
XDH 2
GRK4 4
Upvotes: 0
Views: 1999
Reputation: 204731
You've confused us. I and some others think all you want is a count of each gene in the file since that's what your input/output and some of your descriptive text states (count the frequency of each gene name in this data file
) which would just be this:
$ awk -F'[=;]' '{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' file
GRK4 4
CLCNKA 4
XDH 2
while everyone else thinks you want a count of specific genes that exist in a different file since that's what your Subject line, proposed algorithm and the rest of your text states.
If everyone else is right then you'd need this tweak to read the "genes" file first and only count the genes in "file" that were listed in "genes":
awk -F'[=;]' 'NR==FNR{genes[$0]; next} $11 in genes{cnt[$11]++} END{for (gene in cnt) print gene, cnt[gene]}' genes file
GRK4 4
CLCNKA 4
XDH 2
Your example doesn't help since it would produce the same output with either interpretation of your requirements so edit your question to clarify what it is you want. In particular if there are genes that you do NOT want counted then include lines containing those in the sample input.
Upvotes: 2
Reputation: 67567
if you only search for a list of genes, an inefficient but straightforward way
read g; do echo -n $g " "; grep -c $g file; done < genes
assuming your genes are listed one at a time in the genes file.
If your file structure is fixed, a more efficient version will be
awk 'NR==FNR{genes[$1];next}
{sub(/Gene.refGene=/,"",$6)}
$6 in genes{count[$6]++}
END{for(g in count) print g,count[g]}' genes FS=';' file
Upvotes: 0
Reputation: 23707
To preserve order provided input file is sorted as given in sample:
$ perl -lne '
($g) = /Gene\.refGene=([^;]+)/;
if($g ne $p && $. > 1)
{
print "$p\t$c";
$c = 0;
}
$c++; $p = $g;
END { print "$p\t$c" }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
If not, use hash variable to increment gene name used as key and an array to store key order
$ perl -lne '
($k) = /Gene\.refGene=([^;]+)/;
push(@o, $k) if !$h{$k}++;
END { print "$_\t$h{$_}" foreach (@o) }' ip.txt
CLCNKA 4
XDH 2
GRK4 4
Upvotes: 0
Reputation: 6960
Here is one-liner:
sed "s/.*Gene.refGene=//;s/\;.*//" test | sort | uniq -c | awk '{print $2,$1}'
sed
- will remove everything from line except gene name
sort
will do sorting by name
uniq -c
- will count number of gene repeats
awk
with swap uniq output (by default it : count pattern)
Upvotes: 0
Reputation: 19757
A simpler solution relying on the uniq command:
#!/bin/bash
cut -d ';' -f 6|cut -d = -f 2|sort|uniq -c|while read -a kv;do
echo ${kv[1]} ${kv[0]}
done
Upvotes: 1
Reputation: 19757
This can also be done in pure bash, by using the associative array feature to count the frequencies:
#!/bin/bash
# declare assoc array
declare -A freq
# split stdin input csv
for gene in $(cut -d ';' -f 6|cut -d = -f 2);do
let freq[$gene]++
done
# loop over array keys
for key in ${!freq[@]}; do
echo ${key} ${freq[$key]}
done
Upvotes: 1
Reputation: 21975
awk
is your friend
awk '{sub(/^.*Gene\.refGene=/,"");sub(/;.*$/,"");
genelist[$0]++}END{for(i in genelist){print i,genelist[i]}}' file
Output
GRK4 4
CLCNKA 4
XDH 2
Sidenote: This may not give you the gene name frequency in the order in which they appear in the file. I guess that is not a requirement afterall.
Upvotes: 2