infaak
infaak

Reputation: 31

How to speed up grep/awk command?

I am going to process the text file (>300 GB) and split it into small text files (~1 GB). I want to speed up grep/awk commands.

I need to grep the line which has values on column b, here are my ways:

# method 1:
awk -F',' '$2 ~ /a/ { print }' input

# method 2:
grep -e ".a" < inpuy

Both of ways cost 1min for each file. So how can I speed up this operation?


Sample of input file:

a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
9,,3,12
10,0,34,45
24,4a83944,3,22
45,,435,34

Expected output file:

a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
24,4a83944,3,22

Upvotes: 1

Views: 3695

Answers (3)

Ed Morton
Ed Morton

Reputation: 203532

I suspect your real problem is that you're calling awk repeatedly (probably in a loop), once per set of values of $2 and generating an output file each time, e.g.:

awk -F, '$2==""' input > novals
awk -F, '$2!=""' input > yesvals
etc.

Don't do that as it's very inefficient since it's reading the whole file on every iteration. Do this instead:

awk '{out=($2=="" ? "novals" : "yesvals")} {print > out}' input

That will create all of your output files with one call to awk. Once you get past about 15 output files it would require GNU awk for internal handling of open file descriptors or you need to add close(out)s when $2 changes and use >> instead of >:

awk '$2!=prev{close(out); out=($2=="" ? "novals" : "yesvals"); prev=$2} {print >> out}' input

and that would be more efficient if you sorted your input file first with (requires GNU sort for -s for stable sort if you care about preserving input ordering for the unique $2 values):

sort -t, -k2,2 -s

Upvotes: 0

How to speed up grep/awk command?

Are you so sure that grep or awk is the culprit of your perceived slowness ? Do you know about cut(1) or sed(1) ? Have you benchmarked the time to run wc(1) on your data? Probably the textual I/O is taking a lot of time.

Please benchmark several times, and use time(1) to benchmark your program.

I have a high-end Debian desktop (with a AMD 2970WX, 64Gb RAM, 1Tbyte SSD system disk, multi-terabyte 7200RPM SATA data disks) and just running wc on a 25Gbyte file (some *.tar.xz archive) sitting on a hard disk takes more than 10 minutes (measured with time), and wc is doing some really simple textual processing by reading that file sequentially so should run faster than grep (but, to my surprise, does not!) or awk on the same data :

wc /big/basile/backup.tar.xz  640.14s user 4.58s system 99% cpu 10:49.92 total

and (using grep on the same file to count occurrences of a)

grep -c a /big/basile/backup.tar.xz  38.30s user 7.60s system 33% cpu 2:17.06 total

general answer to your question:

Just write cleverly (with efficient O(log n) time complexity data structures: red-black trees, or hash tables, etc ...) an equivalent program in C or C++ or Ocaml or most other good language and implementation. Or buy more RAM to increase your page cache. Or buy an SSD to hold your data. And repeat your benchmarks more than once (because of the page cache).

suggestion for your problem : use a relational database

It is likely that using a plain textual file of 300Gb is not the best approach. Having huge textual files is usually wrong and is likely to be wrong once you need to process several times the same data. You'll better pre-process it somehow..

If you repeat the same grep search or awk execution on the same data file more than once, consider instead using sqlite (see also this answer) or even some other real relational database (e.g. with PostGreSQL or some other good RDBMS) to store then process your original data.

So a possible approach (if you have enough disk space) might be to write some program (in C, Python, Ocaml etc...), fed by your original data, and filling some sqlite database. Be sure to have clever database indexes and take time to design a good enough database schema, being aware of database normalization.

Upvotes: 4

James Brown
James Brown

Reputation: 37404

Use mawk, avoid regex and do:

$ mawk -F, '$2!=""' file
a,b,c,d
1,4a337485,2,54
4,2a4645647,4,56
6,5a3489556,3,22
10,0,34,45
24,4a83944,3,22

Let us know how long that took.

I did some tests with 10M records of your data, based on the results: use mawk and regex:

GNU awk and regex:

$ time gawk -F, '$2~/a/' file > /dev/null

real    0m7.494s
user    0m7.440s
sys     0m0.052s

GNU awk and no regex:

$ time gawk -F, '$2!=""' file >/dev/null

real    0m9.330s
user    0m9.276s
sys     0m0.052s

mawk and no regex:

$ time mawk -F, '$2!=""' file >/dev/null

real    0m4.961s
user    0m4.904s
sys     0m0.060s

mawk and regex:

$ time mawk -F, '$2~/a/' file > /dev/null

real    0m3.672s
user    0m3.600s
sys     0m0.068s

Upvotes: 2

Related Questions