joshua1609
joshua1609

Reputation: 91

Awk - sort data inside program - custom datetime stamp

I am using gawk to process a large text file. I now have a need to sort a column in the file. Is it possible to use gawk to sort on a column that contains a datetime stamp?

The datetime stamp shown in column 2 is "m/d/yy, H:M AM/PM" in double quotes.

1,”2/1/22, 9:25 AM",
2,"2/8/22, 11:03 AM",
3,"2/7/22, 8:57 AM",
4,"2/18/22, 5:50 PM",
5,"2/24/22, 9:11 AM",
6,"2/7/22, 1:59 PM",
7,"2/9/22, 7:34 AM",
8,"2/16/22, 7:14 AM",
9,"1/31/22, 5:53 PM",

Can I sort that data in gawk, so I can insert a third column and have my table as

9,"1/31/22, 5:53 PM",1
1,”2/1/22, 9:25 AM",2
3,"2/7/22, 8:57 AM",3
6,"2/7/22, 1:59 PM",4
2,"2/8/22, 11:03 AM",5
7,"2/9/22, 7:34 AM",6
8,"2/16/22, 7:14 AM",7
4,"2/18/22, 5:50 PM",8
5,"2/24/22, 9:11 AM",9

Is this possible in gawk?

Upvotes: 0

Views: 48

Answers (1)

tshiono
tshiono

Reputation: 22032

Assuming the size of the file does not exceed the available memory, would you please try:

gawk '
    # custom function to compare the timestamps
    function date_sort(i1, v1, i2, v2,     a1, a2, s1, s2) {
        split(v1, a1, /[,"/: ]+/)
        split(v2, a2, /[,"/: ]+/)
        # generate strings of "yy-mm-dd (AM|PM) hh:mm"
        s1 = sprintf("%02d-%02d-%02d %s %02d:%02d",
                a1[4], a1[2], a1[3], a1[7], a1[5], a1[6])
        s2 = sprintf("%02d-%02d-%02d %s %02d:%02d",
                a2[4], a2[2], a2[3], a2[7], a2[5], a2[6])
        # return the comparison result
        if (s1 < s2) {
            return -1
        } else if (s1 > s2) {
            return 1
        } else {
            return 0
        }
    }

    # read the file and store lines in an array "data"
    {data[NR] = $0}
    END {
        # sort "data" using the function "date_sort" and store the result in "sorted"
        n = asort(data, sorted, "date_sort")
        for (i = 1; i <= n; i++) {
            print sorted[i] i
        }
    }
' input_file

Output:

9,"1/31/22, 5:53 PM",1
1,"2/1/22, 9:25 AM",2
3,"2/7/22, 8:57 AM",3
6,"2/7/22, 1:59 PM",4
2,"2/8/22, 11:03 AM",5
7,"2/9/22, 7:34 AM",6
8,"2/16/22, 7:14 AM",7
4,"2/18/22, 5:50 PM",8
5,"2/24/22, 9:11 AM",9

The 3rd argument "date_sort" to the asort function is a custom function name to compare the list to sort. The function should be designed to return a negative value, a positive value, or zero depending on the comparison of passed two elements. Four arguments are implicitly passed to the function during the sort operation: index of the element of the 1st list, value of the element of the 1st list, index of the element of the 2nd list, value of the element of the 2nd list, in order. I have named them i1, v1, i2, v2, by convention.

The remaining variables a1, a2, s1, s2 are the local variables. It is a feature of awk that the variable list which is longer than the number of arguments passed by the caller is used to declare local variable names to limit the scope of the variables within the function.

Upvotes: 1

Related Questions