Rey Leonard Amorato
Rey Leonard Amorato

Reputation: 123

AWK: How do I compute the total hours from a CSV file

I have this file which reads

001,Time-in,2017-06-25 08:04:42,08:00:00,
001,Time-out,2017-06-25 17:04:42,17:00:00,
001,Time-in,2017-06-25 18:04:42,18:00:00,
001,Time-out,2017-06-25 22:04:42,22:00:00,
...

where field 1 is the ID number; 2 is the action performed; 3 is the exact timestamp; and 4 is the rounded off time.

I would like to calculate the total hours per ID based on field 4. I know I can use the formula

((Out2+Out1)-(In2+In1))
or
((Out1-In1)+(Out2-In2))

to get the total hours, but I'm quite stuck as to how I should begin.

I would like to get this output:

001,13
002,12
..,..
..,..

Where field 1 is the ID and 2 will be the total hours computed. Also, please note that the real file would be jumbled and not sorted like the example above. If any of the required entries are missing, i.e one time-out missing etc., it should just print that it skipped that particular ID.

Any thoughts regarding this would be extremely helpful. Thanks.

Upvotes: 0

Views: 90

Answers (2)

Ed Morton
Ed Morton

Reputation: 203665

$ cat tst.awk
BEGIN { FS="[-, :]" }
$3=="in"  { tin[$1]  += $10 }
$3=="out" { tout[$1] += $10 }
END {
    for (key in tin) {
        print key, tout[key] - tin[key]
    }
}

$ awk -f tst.awk file
001 13

Upvotes: 1

(No error handling or error recovery below.)

I'd probably write a function to return epoch time, given an ISO timestamp. Epoch time makes the arithmetic easy. But it uses the full timestamp, not your rounded values.

function epoch_time(ts) {

    gsub("-", " ", ts)
    gsub(":", " ", ts)

    return mktime(ts)
}

Assuming we can rely on the format on the input file--a BIG assumption--you can use pretty simple code to select and process each line of the input file.

$2 == "Time-in" {
    timein = epoch_time($3)
}

$2 == "Time-out" {
    timeout = epoch_time($3)
    # Add the result to any existing value for this id number.
    # Express in hours.
    output[$1] += (((timeout - timein) / 60) / 60)
}

END {
    for (key in output) print key, output[key] 
}

So the full code would look like this,

# timestamp.awk
#
$2 == "Time-in" {
    timein = epoch_time($3)
}

$2 == "Time-out" {
    timeout = epoch_time($3)
    # Add the result to any existing value for this id number.
    # Express in hours.
    output[$1] += (((timeout - timein) / 60) / 60)
}

END {
    for (key in output) print key, output[key] 
}

function epoch_time(ts) {

    gsub("-", " ", ts)
    gsub(":", " ", ts)

    return mktime(ts)
}

. . . and I'd call it like this.

$ awk -F, -f timestamp.awk datafilename

For this data, I get the output farther below.

001,Time-in,2017-06-25 08:04:42,08:00:00,
001,Time-out,2017-06-25 17:04:42,17:00:00,
001,Time-in,2017-06-25 18:04:42,18:00:00,
001,Time-out,2017-06-25 22:04:42,22:00:00,

002,Time-in,2017-06-25 09:04:42,08:00:00,
002,Time-out,2017-06-25 17:04:42,17:00:00,
002,Time-in,2017-06-25 19:04:42,18:00:00,
002,Time-out,2017-06-25 22:04:42,22:00:00,
$ awk -F, -f timestamp.awk datafilename
002 11
001 13

Upvotes: 0

Related Questions