Reputation: 91
I am using gawk to process a large text file. I now have a need to sort a column in the file. Is it possible to use gawk to sort on a column that contains a datetime stamp?
The datetime stamp shown in column 2 is "m/d/yy, H:M AM/PM
" in double quotes.
1,”2/1/22, 9:25 AM",
2,"2/8/22, 11:03 AM",
3,"2/7/22, 8:57 AM",
4,"2/18/22, 5:50 PM",
5,"2/24/22, 9:11 AM",
6,"2/7/22, 1:59 PM",
7,"2/9/22, 7:34 AM",
8,"2/16/22, 7:14 AM",
9,"1/31/22, 5:53 PM",
Can I sort that data in gawk, so I can insert a third column and have my table as
9,"1/31/22, 5:53 PM",1
1,”2/1/22, 9:25 AM",2
3,"2/7/22, 8:57 AM",3
6,"2/7/22, 1:59 PM",4
2,"2/8/22, 11:03 AM",5
7,"2/9/22, 7:34 AM",6
8,"2/16/22, 7:14 AM",7
4,"2/18/22, 5:50 PM",8
5,"2/24/22, 9:11 AM",9
Is this possible in gawk?
Upvotes: 0
Views: 48
Reputation: 22032
Assuming the size of the file does not exceed the available memory, would you please try:
gawk '
# custom function to compare the timestamps
function date_sort(i1, v1, i2, v2, a1, a2, s1, s2) {
split(v1, a1, /[,"/: ]+/)
split(v2, a2, /[,"/: ]+/)
# generate strings of "yy-mm-dd (AM|PM) hh:mm"
s1 = sprintf("%02d-%02d-%02d %s %02d:%02d",
a1[4], a1[2], a1[3], a1[7], a1[5], a1[6])
s2 = sprintf("%02d-%02d-%02d %s %02d:%02d",
a2[4], a2[2], a2[3], a2[7], a2[5], a2[6])
# return the comparison result
if (s1 < s2) {
return -1
} else if (s1 > s2) {
return 1
} else {
return 0
}
}
# read the file and store lines in an array "data"
{data[NR] = $0}
END {
# sort "data" using the function "date_sort" and store the result in "sorted"
n = asort(data, sorted, "date_sort")
for (i = 1; i <= n; i++) {
print sorted[i] i
}
}
' input_file
Output:
9,"1/31/22, 5:53 PM",1
1,"2/1/22, 9:25 AM",2
3,"2/7/22, 8:57 AM",3
6,"2/7/22, 1:59 PM",4
2,"2/8/22, 11:03 AM",5
7,"2/9/22, 7:34 AM",6
8,"2/16/22, 7:14 AM",7
4,"2/18/22, 5:50 PM",8
5,"2/24/22, 9:11 AM",9
The 3rd argument "date_sort"
to the asort
function is a custom function name
to compare the list to sort. The function should be designed to return a negative value, a positive value, or zero depending on the
comparison of passed two elements. Four arguments are implicitly passed to the function during the sort operation:
index of the element of the 1st list, value of the element of the 1st list,
index of the element of the 2nd list, value of the element of the 2nd list, in order.
I have named them i1
, v1
, i2
, v2
, by convention.
The remaining variables a1
, a2
, s1
, s2
are the local variables.
It is a feature of awk
that the variable list which is longer than the number
of arguments passed by the caller is used to declare local variable names to limit
the scope of the variables within the function.
Upvotes: 1