Reputation: 107
I have some thousands of log (.txt) files (their names or order does not matter, neither does the order of entries in the final output files) which consist of a unix time stamp and a value, such as:
infile1.txt:
1361775157 a
1361775315 b
1379007707 c
1379014884 d
infile2.txt:
1360483293 e
1361384920 f
1372948120 g
1373201928 h
My goal is to split them based into arbitrarily defined time intervals (e.g. in this case with 1360000000, 1370000000 and 1380000000 as the bounds), so that I get as many files as intervals:
1360000000-1370000000.txt:
1361775157 a
1361775315 b
1360483293 e
1361384920 f
1370000000-1380000000.txt:
1379007707 c
1379014884 d
1372948120 g
1373201928 h
My current approach is to run a script that filters the entries of each period in a loop for each time period (start and end as first and second argument) and adds them to a file:
#!/bin/bash
for i in *txt; do
awk -v t1=$1 -v t2=$2 '$1 >= t1 && $1 < t2' $i >> "elsewhere/$1-$2.txt"
done
However, this means that for each time period all files are read, which seems inefficient to me. Is there a way to read each file only once, and append each line to a file corresponding to its time period?
Upvotes: 1
Views: 514
Reputation: 203403
I'd use an approach like this:
$ cat tst.awk
{
bucket = int($1/inc)
print $0 " > " ( (inc*bucket) "-" (inc*(bucket+1)-1) ".txt" )
}
$ awk -v inc='10000000' -f tst.awk file1 file2
1361775157 a > 1360000000-1369999999.txt
1361775315 b > 1360000000-1369999999.txt
1379007707 c > 1370000000-1379999999.txt
1379014884 d > 1370000000-1379999999.txt
1360483293 e > 1360000000-1369999999.txt
1361384920 f > 1360000000-1369999999.txt
1372948120 g > 1370000000-1379999999.txt
1373201928 h > 1370000000-1379999999.txt
If you're using GNU awk (which handles closing/reopening files for you when needed) then just change $0 " > "
to >
when done testing, otherwise make it:
{
bucket = int($1/inc)
if ( bucket != prev ) {
close(out)
out = (inc*bucket) "-" (inc*(bucket+1)-1) ".txt"
prev = bucket
}
print >> out
}
to work in any awk.
Upvotes: 5