Lisael
Lisael

Reputation: 179

Efficiently format dates from a log file with POSIX tools

Context: in a POSIX-only project, I need to reformat a logfile, changing iso dates to UNIX timestamps. The log file looks like:

2022-01-11T18:22:46+0100    call_ring   +3366
2022-01-11T19:36:54+0100    call_ring   +33611
2022-01-12T07:49:15+0100    call_ring   +33616
2022-01-12T08:57:20+0100    call_ring   +33621
2022-01-12T09:42:56+0100    call_ring   +33648
2022-01-12T12:20:48+0100    call_ring   +3364
2022-01-12T12:28:01+0100    call_ring   +3364
2022-01-12T13:16:31+0100    call_ring   +33628

For now I use

awk -F'\t' '{cmd="date \"+%s\t"$3"\" -d "$1;system(cmd)}' logs.tsv

But on large files it's unbearably slow (more than 50s for ~20k lines). I believe the system() function forking 20k process is the cause.

Is there a way to be faster in a POSIX shell script? I'd like to avoid Python or perl for this, too.

(Note that date -d is not POSIX, we plan to write an in-house binary for this. Therefore date -d will be accepted in the answer)

Upvotes: 0

Views: 645

Answers (3)

dave_thompson_085
dave_thompson_085

Reputation: 38781

<fart age=old mode=grumpy> kids today! </>

$ cat <<EOF >calctime.awk
# this simplified formula only works 2000-03 to 2100-02 which I assume 
# should cover any logfile timestamps of interest today or the near future;
# it can fairly easily be extended as far back or forward as the Gregorian 
# calendar was/remains in use
# oops! offset was backward (the only thing I couldn't easily test at first)
{ split($1,a,/[-+T:]/);
  t = a[2]<=2;
  t = int((a[1]-2000-t)*365.25) + int((a[2]-3+t*12)*30.6+.5) + a[3]-1;
  t = t*86400 + a[4]*3600 + a[5]*60 + a[6] + 951868800;
  t += (substr($1,20,1)=="+"?-1:+1)*(substr(a[7],1,2)*3600+substr(a[7],3,2)*60);
  print t,$3; }
EOF
$ awk -f calctime.awk -v OFS='\t' infile >outfile # or pipe to something 

Time less than 0.1 second.

Added: You should be able to see how this works if you take the 12 months of our (for Western cultures) calendar starting with March, as the Romans did, and see the lengths have a repeating pattern except for the last (February): 31-30-31-30-31 31-30-31-30-31 31-(28/29). As a result, the number of months from the March (numbered zero) 'under' a given month to that month (numbering Jan and Feb as 10 and 11 of the prior year) times 30.6 rounded to the nearest integer, implemented in awk by adding .5 and truncating, gives the number of days for that partial year; plus 365.25 times the number of years between 2000-03 and the March used; plus the day of month counting from zero gives the total days from 2000-03-01 to the desired date. Multiply the number of days by 86400 seconds per day and add the desired time of day gives seconds since 2000-03-01Tmidnight. Add the number of seconds from 1970-01-01Tmidnight (the Unix/POSIX epoch) to 2003-03-01Tmidnight, and adjust by the timezone offset for the desired time since Unix/POSIX is 'UTC' (really tweaked UTC without leapseconds, essentially mimicking the prior GMT, but in both cases without timezone or daylight/summer adjustment), and done.

I should note this is based on FORTRAN code I saw widely used around 1970, except that had the data fields (year,month,day,etc) already in numeric form -- FORTRAN didn't have a 'character' type you could parse until 1977, but it had formatted READ and WRITE since its creation -- and it was routinely written without explanation or claim of authorship apparently on the ground it was 'just obvious'. Of course it used different epochs, commonly 1900, because almost no one had heard of Unix then, but that's just a constant. And FORTRAN didn't (and its modern form Fortran still doesn't) have BCPL-C-awk-perl style mapping of boolean to integer 0/1, so you used truncating integer division like (1-(M+9)/12) or ((14-M)/12); you could (also) do that in awk but it needs explicit truncation with int().

Upvotes: 3

Renaud Pacalet
Renaud Pacalet

Reputation: 28985

Just in case you could relax your POSIX-only requirement for a 150+ speed-up factor... With the GNU awk built-in mktime and gensub functions (tested with a 20k lines input):

$ time awk '{utc = gensub(/.*([+-][0-9]+)/,"\\1",1,$1);
             gsub(/-|T|:|\+[0-9]+/," ",$1);
             $1 = mktime($1, utc); print}' logs.tsv > tmp.tsv

real    0m0.290s
user    0m0.280s
sys     0m0.005s

$ head tmp.tsv
1641925366 call_ring +3366
1641929814 call_ring +33611
...

gensub extracts the UTC flags (e.g. +0100) from the first field, gsub reformats the first field as the space-separated string required by mktime (e.g. 2022 01 11 18 22 46).

This version attempts to determine whether daylight saving time is in effect for the specified time. Replace mktime($1, utc) by:

  • mktime($1 " 1", utc) if you want awk to assume daylight saving time,
  • mktime($1 " 0", utc) if you want awk to assume standard time.

Upvotes: 0

tshiono
tshiono

Reputation: 22012

This may not be directly answering your question, but I have benchmarked between several solutions by generating 20,000 lines as the posted log file.

OP's awk solution

#!/bin/sh

awk -F'\t' '{cmd="date \"+%s\t"$3"\" -d "$1;system(cmd)}' logs.tsv > /dev/null

took 59.5 seconds to complete.

POSIX sh

#!/bin/sh

IFS=$(printf "\t")

while read -r a b c; do
    echo $(date +%s -d "$a")"$IFS$c"
done < logs.tsv > /dev/null

took 35.6 seconds.

date command only

#!/bin/sh

i=0
while [ "$i" -lt 20000 ]; do
    date +%s -d "2022-01-11T18:22:46+0100"
    i=$(($i+1))
done > /dev/null

took 33.4 seconds.

It seems the most of the execution time is consumed by the date command. If you plan to write your own substitution for date -d, it will be significantly faster.

Upvotes: 0

Related Questions