Komal Rathi
Komal Rathi

Reputation: 4274

Fast grep on huge csv files

I have a file (queryids.txt) with a list of 847 keywords to search. I have to grep the keywords from about 12 huge csv files (the biggest has 2,184,820,000 lines). Eventually we will load it into a database of some sort but for now, we just want certain keywords to be grep'ed.

My command is:

LC_ALL=C fgrep -f queryids.txt subject.csv

I am thinking of writing a bash script like this:

#!/bin/bash

for f in *.csv
do
    ( echo "Processing $f"
    filename=$(basename "$f")
    filename="${filename%.*}"
    LC_ALL=C fgrep -f queryids.txt $f > $filename"_goi.csv" ) &
done

and I will run it using: nohup bash myscript.sh &

The queryids.txt looks like this:

ENST00000401850
ENST00000249005
ENST00000381278
ENST00000483026
ENST00000465765
ENST00000269080
ENST00000586539
ENST00000588458
ENST00000586292
ENST00000591459

The subject file looks like this:

target_id,length,eff_length,est_counts,tpm,id
ENST00000619216.1,68,2.65769E1,0.5,0.300188,00065a62-5e18-4223-a884-12fca053a109
ENST00000473358.1,712,5.39477E2,8.26564,0.244474,00065a62-5e18-4223-a884-12fca053a109
ENST00000469289.1,535,3.62675E2,4.82917,0.212463,00065a62-5e18-4223-a884-12fca053a109
ENST00000607096.1,138,1.92013E1,0,0,00065a62-5e18-4223-a884-12fca053a109
ENST00000417324.1,1187,1.01447E3,0,0,00065a62-5e18-4223-a884-12fca053a109

I am concerned this will take a long time. Is there a faster way to do this?

Thanks!

Upvotes: 1

Views: 1660

Answers (2)

anubhava
anubhava

Reputation: 786329

Few things I can suggest to improve the performance:

  1. No need to spawn a sub-shell using ( .. ) &, you can use braces { ... } & if needed.
  2. Use grep -F (non-regex or fixed string search) to make grep run faster
  3. Avoid basename command and use bash string manipulation

Try this script:

#!/bin/bash

for f in *.csv; do
    echo "Processing $f"
    filename="${f##*/}"
    LC_ALL=C grep -Ff queryids.txt "$f" > "${filename%.*}_goi.csv"
done

I suggest you run this on a smaller dataset to compare the performance gain.

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 204721

You could try this instead:

awk '
BEGIN {
    while ( (getline line < "queryids.txt") > 0  ) {
        re = ( re=="" ? "" : re "|") line
    }
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$0 ~ re { print > out }
' *.csv

It's using a regexp rather than string comparison - whether or not that matters and, if so, what we can do about it depends on the values in queryids.txt. In fact there may be a vastly faster and more robust way to do this depending on what your files contain so if you edit your question to include some examples of your file contents we could be of more help.

I see you have now posted some sample input and indeed we can do this much faster and more robustly by using a hash lookup:

awk '
BEGIN {
    FS="."
    while ( (getline line < "queryids.txt") > 0  ) {
        ids[line]
    }
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$1 in ids { print > out }
' *.csv

Upvotes: 0

Related Questions