Reputation: 351
I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?
Upvotes: 1
Views: 74
Reputation: 35106
Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt
in memory, one awk
solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR
- this test only matches the first file to be processed (allGenes.txt
)gene[$1]
- store each gene as an index in an associative arraynext
stop processing and go to next line in the file$1 in gene
- applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current lineI wouldn't expect this to run any/much faster than the grep
solution the OP is currently using (especially with shelter's suggestion to use -F
instead of -w
), but it should be relatively quick to test and see ....
Upvotes: 2
Reputation: 33740
GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
Upvotes: 1
Reputation: 575
You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done
Upvotes: 0