split file into time period files based on time unix stamp

Question

I have some thousands of log (.txt) files (their names or order does not matter, neither does the order of entries in the final output files) which consist of a unix time stamp and a value, such as:

infile1.txt:
1361775157 a
1361775315 b            
1379007707 c
1379014884 d

infile2.txt:
1360483293 e
1361384920 f
1372948120 g
1373201928 h

My goal is to split them based into arbitrarily defined time intervals (e.g. in this case with 1360000000, 1370000000 and 1380000000 as the bounds), so that I get as many files as intervals:

1360000000-1370000000.txt:
1361775157 a 
1361775315 b    
1360483293 e
1361384920 f        

1370000000-1380000000.txt:
1379007707 c
1379014884 d
1372948120 g
1373201928 h

My current approach is to run a script that filters the entries of each period in a loop for each time period (start and end as first and second argument) and adds them to a file:

#!/bin/bash

for i in *txt; do
    awk -v t1=$1 -v t2=$2 '$1 >= t1 && $1 < t2' $i >> "elsewhere/$1-$2.txt"
done

However, this means that for each time period all files are read, which seems inefficient to me. Is there a way to read each file only once, and append each line to a file corresponding to its time period?

Ed Morton · Accepted Answer

I'd use an approach like this:

$ cat tst.awk
{
    bucket = int($1/inc)
    print $0 " > " ( (inc*bucket) "-" (inc*(bucket+1)-1) ".txt" )
}

$ awk -v inc='10000000' -f tst.awk file1 file2
1361775157 a > 1360000000-1369999999.txt
1361775315 b > 1360000000-1369999999.txt
1379007707 c > 1370000000-1379999999.txt
1379014884 d > 1370000000-1379999999.txt
1360483293 e > 1360000000-1369999999.txt
1361384920 f > 1360000000-1369999999.txt
1372948120 g > 1370000000-1379999999.txt
1373201928 h > 1370000000-1379999999.txt

If you're using GNU awk (which handles closing/reopening files for you when needed) then just change $0 " > " to > when done testing, otherwise make it:

{
    bucket = int($1/inc)
    if ( bucket != prev ) {
        close(out)
        out = (inc*bucket) "-" (inc*(bucket+1)-1) ".txt"
        prev = bucket
    }
    print >> out
}

to work in any awk.

split file into time period files based on time unix stamp

Answers (1)

Related Questions