user2548142
user2548142

Reputation: 53

Grepping progressively through large file

I have several large data files (~100MB-1GB of text) and a sorted list of tens of thousands of timestamps that index data points of interest. The timestamp file looks like:

12345
15467
67256
182387
199364
...

And the data file looks like:

Line of text
12345 0.234 0.123 2.321
More text
Some unimportant data
14509 0.987 0.543 3.600
More text
15467 0.678 0.345 4.431

The data in the second file is all in order of timestamp. I want to grep through the second file using the time stamps of the first, printing the timestamp and fourth data item in an output file. I've been using this:

grep -wf time.stamps data.file | awk '{print $1 "\t" $4 }'  >> output.file

This is taking on the order of a day to complete for each data file. The problem is that this command searches though the entire data file for every line in time.stamps, but I only need the search to pick up from the last data point. Is there any way to speed up this process?

Upvotes: 5

Views: 195

Answers (3)

user1666959
user1666959

Reputation: 1855

'grep' has a little used option -f filename which gets the patterns from filename and does the matching. It is likely to beat the awk solution and your timestamps would not have to be sorted.

Upvotes: 0

iruvar
iruvar

Reputation: 23364

JS웃's awk solution is probably the way to go. If join is available and the first field of the irrelevant "data" is not numeric, you could exploit the fact that the files are in the same order and avoid a sorting step. This example uses bash process substitution on linux

join  -o2.1,2.4 -1 1 -2 1 key.txt <(awk '$1 ~ /^[[:digit:]]+$/' data.txt)

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77085

You can do this entirely in awk

awk 'NR==FNR{a[$1]++;next}($1 in a){print $1,$4}' timestampfile datafile

Upvotes: 4

Related Questions