Fast grep on huge csv files

Question

I have a file (queryids.txt) with a list of 847 keywords to search. I have to grep the keywords from about 12 huge csv files (the biggest has 2,184,820,000 lines). Eventually we will load it into a database of some sort but for now, we just want certain keywords to be grep'ed.

My command is:

LC_ALL=C fgrep -f queryids.txt subject.csv

I am thinking of writing a bash script like this:

#!/bin/bash

for f in *.csv
do
    ( echo "Processing $f"
    filename=$(basename "$f")
    filename="${filename%.*}"
    LC_ALL=C fgrep -f queryids.txt $f > $filename"_goi.csv" ) &
done

and I will run it using: nohup bash myscript.sh &

The queryids.txt looks like this:

ENST00000401850
ENST00000249005
ENST00000381278
ENST00000483026
ENST00000465765
ENST00000269080
ENST00000586539
ENST00000588458
ENST00000586292
ENST00000591459

The subject file looks like this:

target_id,length,eff_length,est_counts,tpm,id
ENST00000619216.1,68,2.65769E1,0.5,0.300188,00065a62-5e18-4223-a884-12fca053a109
ENST00000473358.1,712,5.39477E2,8.26564,0.244474,00065a62-5e18-4223-a884-12fca053a109
ENST00000469289.1,535,3.62675E2,4.82917,0.212463,00065a62-5e18-4223-a884-12fca053a109
ENST00000607096.1,138,1.92013E1,0,0,00065a62-5e18-4223-a884-12fca053a109
ENST00000417324.1,1187,1.01447E3,0,0,00065a62-5e18-4223-a884-12fca053a109

I am concerned this will take a long time. Is there a faster way to do this?

Thanks!

anubhava · Accepted Answer

Few things I can suggest to improve the performance:

No need to spawn a sub-shell using ( .. ) &, you can use braces { ... } & if needed.
Use grep -F (non-regex or fixed string search) to make grep run faster
Avoid basename command and use bash string manipulation

Try this script:

#!/bin/bash

for f in *.csv; do
    echo "Processing $f"
    filename="${f##*/}"
    LC_ALL=C grep -Ff queryids.txt "$f" > "${filename%.*}_goi.csv"
done

I suggest you run this on a smaller dataset to compare the performance gain.

Fast grep on huge csv files

Answers (2)

Related Questions