Reputation: 53
I have several large data files (~100MB-1GB of text) and a sorted list of tens of thousands of timestamps that index data points of interest. The timestamp file looks like:
12345
15467
67256
182387
199364
...
And the data file looks like:
Line of text
12345 0.234 0.123 2.321
More text
Some unimportant data
14509 0.987 0.543 3.600
More text
15467 0.678 0.345 4.431
The data in the second file is all in order of timestamp. I want to grep through the second file using the time stamps of the first, printing the timestamp and fourth data item in an output file. I've been using this:
grep -wf time.stamps data.file | awk '{print $1 "\t" $4 }' >> output.file
This is taking on the order of a day to complete for each data file. The problem is that this command searches though the entire data file for every line in time.stamps, but I only need the search to pick up from the last data point. Is there any way to speed up this process?
Upvotes: 5
Views: 195
Reputation: 1855
'grep' has a little used option -f filename
which gets the patterns from filename and does the matching. It is likely to beat the awk
solution and your timestamps would not have to be sorted.
Upvotes: 0
Reputation: 23364
JS웃's awk
solution is probably the way to go. If join
is available and the first field of the irrelevant "data" is not numeric, you could exploit the fact that the files are in the same order and avoid a sorting step. This example uses bash process substitution on linux
join -o2.1,2.4 -1 1 -2 1 key.txt <(awk '$1 ~ /^[[:digit:]]+$/' data.txt)
Upvotes: 1
Reputation: 77085
You can do this entirely in awk
…
awk 'NR==FNR{a[$1]++;next}($1 in a){print $1,$4}' timestampfile datafile
Upvotes: 4