Reputation: 1
I have a number of large, space delimited, WhatsApp chat logs that I need to convert the timestamps from 12-hour time format to 24-hour time format.
95% of the content of these files maintain the same line format as follows:
MM/DD/YY,|HH:MM|XM|-|participant:|chatText
However, there are a few instances spread throughout these chat logs that do not maintain the standard line format as shown above.
Here is a sample of the logs:
5/30/22, 9:50 AM - person2: Good morning
5/30/22, 11:35 AM - person1: Hi, how are you?
5/30/22, 11:47 AM - person2: I am well
Transfer number: 3778324
Completed:
5/30/22, 12:55 PM - person1: https://mylink.com
5/30/22, 12:59 PM - person2: <Media omitted>
5/30/22, 9:46 PM - person1: thanks
Here are the requirements:
This is a sample of what it should look like after changes:
5/30/22, 09:50 - person2: Good morning
5/30/22, 11:35 - person1: Hi, how are you?
5/30/22, 11:47 - person2: I am well
Transfer number: 3778324
Completed:
5/30/22, 12:55 - person1: https://mylink.com
5/30/22, 12:59 - person2: <Media omitted>
5/30/22, 21:46 - person1: thanks
This is what I've been able to come up with so far, but I can't figure out how to get beyond the HH position, nor do I have any clue how I will avoid making changes to the non-standard formatted lines:
echo "5/30/22, 9:46 PM - person1: thanks"\ |awk -F' ' 'BEGIN{OFS=" "}{("date --date=\""$2 $3"\" +%H:$M") |getline $2;print }'
Any help would be greatly appreciated!
Upvotes: 0
Views: 297
Reputation: 203129
Using any POSIX awk:
$ cat tst.awk
$1 ~ "^([0-9]{1,2}/){2}[0-9]{2},$" {
split($2,t,":")
if ( ($3 == "PM") && (t[1] < 12) ) {
t[1] += 12
}
else if ( ($3 == "AM") && (t[1] == 12) ) {
t[1] = 0
}
time = sprintf(" %02d:%02d ", t[1], t[2])
sub(/ [0-9]{1,2}:[0-9]{2} [AP]M /,time)
}
{ print }
$ awk -f tst.awk file
5/30/22, 09:50 - person2: Good morning
5/30/22, 11:35 - person1: Hi, how are you?
5/30/22, 11:47 - person2: I am well
Transfer number: 3778324
Completed:
5/30/22, 12:55 - person1: https://mylink.com
5/30/22, 12:59 - person2: <Media omitted>
5/30/22, 21:46 - person1: thanks
The above uses sub()
to change $0
instead of directly changing $2
and $3
so that it will not change any white space on the lines that start with a timestamp (tabs and/or chains of blanks would be converted to single blanks if it changed $2
or $3
directly), e.g. changing $0
with the above script:
$ cat file1
5/30/22, 9:50 AM - person2: Good morning
$ awk -f tst.awk file1
5/30/22, 09:50 - person2: Good morning
vs if it changed $2
directly (note the change in white space between Good
and morning
):
$ cat tst.awk
$1 ~ "^([0-9]{1,2}/){2}[0-9]{2},$" {
split($2,t,":")
if ( ($3 == "PM") && (t[1] < 12) ) {
t[1] += 12
}
else if ( ($3 == "AM") && (t[1] == 12) ) {
t[1] = 0
}
$2 = sprintf("%02d:%02d", t[1], t[2])
sub(/ [AP]M /," ")
}
{ print }
$ awk -f tst.awk file1
5/30/22, 09:50 - person2: Good morning
Upvotes: 2
Reputation: 28920
Just for the fun, and because you tagged sed
, here is a solution with GNU sed
and date
. But don't use this on large files, it would be far slower than the excellent other awk
solutions: for each line to modify it executes one date
command with the shell.
$ sed -E 'h;s!^(\S+),(\s+\S+\s+[AP]M\>).*!date -d "\1 \2" +"%D, %R"!e;T;G;s!\n(\s*\S+){3}!!' file.log
05/30/22, 09:50 - person2: Good morning
05/30/22, 11:35 - person1: Hi, how are you?
05/30/22, 11:47 - person2: I am well
Transfer number: 3778324
Completed:
05/30/22, 12:55 - person1: https://mylink.com
05/30/22, 12:59 - person2: <Media omitted>
05/30/22, 21:46 - person1: thanks
Explanations: the e
flag of the substitute command executes the content of the pattern space with the shell and replaces the pattern space with the output.
We first copy the input line in the hold space (h
) such that we can later extract the trailing part.
If the line in the pattern space is DATE, HOUR [AP]M <SOMETHING>
, we replace it with date -d "DATE HOUR [AP]M" +"%D, %R"
, execute that with the shell, and replace the pattern space with the output, thanks to the e
flag.
If the line was a "non-standard formatted line", there has been no substitution, we print it and move to the next line (T
).
Else we append a newline and the hold space to the pattern space (G
), which becomes:
DATE, NEWHOUR-newline-DATE, OLDHOUR [AP]M <SOMETHING>
We delete newline-DATE, OLDHOUR [AP]M
and print.
Upvotes: 0
Reputation: 36340
I would harness GNU AWK
for this task following way, let file.txt
content be
5/30/22, 9:50 AM - person2: Good morning
5/30/22, 11:35 AM - person1: Hi, how are you?
5/30/22, 11:47 AM - person2: I am well
Transfer number: 3778324
Completed:
5/30/22, 12:55 PM - person1: https://mylink.com
5/30/22, 12:59 PM - person2: <Media omitted>
5/30/22, 9:46 PM - person1: thanks
then
awk '$3~/^[AP]M$/{split($2,arr,":");if($3=="PM"&&arr[1]<12){arr[1]+=12};$2=sprintf("%02d:%02d",arr[1],arr[2])}{print}' file.txt
gives output
5/30/22, 09:50 AM - person2: Good morning
5/30/22, 11:35 AM - person1: Hi, how are you?
5/30/22, 11:47 AM - person2: I am well
Transfer number: 3778324
Completed:
5/30/22, 12:55 PM - person1: https://mylink.com
5/30/22, 12:59 PM - person2: <Media omitted>
5/30/22, 21:46 PM - person1: thanks
Explanation: for lines where 3rd field is AM or PM, do split 2nd field at :
character and put result of that into array arr
, if 3rd field is PM
and 1st element of array (i.e. hour) is less than 12 increase it by 12, set 2nd field to HH:MM
where HH is hour, zero-padded to width of 2, MM is minute, zero-padded to width of 2. Independently from such change made or not print
line. If you want to know more about split
or sprintf
then read String Functions (The GNU Awk Users Guide). Observe that I do not set FS
or OFS
as defaults are fine for presented task.
(tested in GNU Awk 5.1.0)
Upvotes: 0