anamaria
anamaria

Reputation: 351

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:

head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...

and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:

head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7  1  89295 129223  - 2073 1.03557 343.245
ENSG00000237683   AL627309.1  1 134901 139379  - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2  1 523009 530148  + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1  1 521369 523833  - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5  1 562757 564390  - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723  1 566454 567996  + 4247 1.05299 592.876 

I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt

I tried using:

grep -w -f allGenes.txt *.v7.egenes.txt > output.txt

but this takes forever to complete. Is there is any way to do this in awk or?

Upvotes: 1

Views: 74

Answers (3)

markp-fuso
markp-fuso

Reputation: 35106

Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:

awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt

Where:

  • NR==FNR - this test only matches the first file to be processed (allGenes.txt)
  • gene[$1] - store each gene as an index in an associative array
  • next stop processing and go to next line in the file
  • $1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line

I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....

Upvotes: 2

Ole Tange
Ole Tange

Reputation: 33740

GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

Upvotes: 1

Dexirian
Dexirian

Reputation: 575

You could try with a while read loop :

#!/bin/bash

while read -r line; do
  grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt

So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?

EDIT :

New version :

#!/bin/bash

for name in $(cat allGenes.txt); do
  grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

Upvotes: 0

Related Questions