Reputation: 13
I have some large logfiles that have the old syslog format dates from RFC3162 (MMM dd HH:mm:ss) that I want to change over to the new syslog format dates from RFC5424 (YYYY-mm-ddTHH:mm:ss +TMZ). I have created the following bash script:
#!/bin/bash
#Loop over directories
for i in $1
do
echo "Processing directory $i"
if [ -d $i ]
then
cd $i
#Loop over log files inside the directory
for j in *.2021
do
echo "Processing file $j"
#Read line by line and perform transformation on dates and append to new file
cat $j | \
while read CMD; do
tmpdate=$(printf '%s\n' "$CMD" | awk -F" $i" 'BEGIN {ORS=""}; {print $1}')
newdate=$(date +'%Y-%m-%dT%H:%M:%S+02:00' -d "$tmpdate")
printf '%s\n' "$CMD" | sed 's/'"$tmpdate"'/'"$newdate"'/g' >> $j.new
done
mv $j.new $j
done
cd ..
fi
done
But this is taking a looooong time to execute since I have files with several million lines (logs dating back over one year on a mail server for example). So far this has been running for days and still a lot of lines to parse :-)
So two questions.
======== EDIT =======
Here are examples of the old format:
Feb 1 21:59:44 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
Feb 1 21:59:44 calendar 50mounted-tests: debug: /dev/sda2 type not recognised; skipping
Feb 1 21:59:44 calendar os-prober: debug: os detected by /usr/lib/os-probes/50mounted-tests
Note that there are 2 spaces between Feb and 1, if the date is 10 or higher the space is only 1 as in
Feb 10 10:39:53 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
In the new format it would look like this:
2021-02-01T21:59:44+02:00 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
2021-02-01T21:59:44+02:00 calendar 50mounted-tests: debug: /dev/sda2 type not recognised; skipping
2021-02-01T21:59:44+02:00 calendar os-prober: debug: os detected by /usr/lib/os-probes/50mounted-tests
TIA.
Upvotes: 1
Views: 96
Reputation: 27215
Why is this script taking such a long time to execute?
Bash is a scripting language and intended to run other programs. Therefore, bash itself as a language isn't very fast. But it gets even worse if you repeatedly start other processes. Starting a process is very costly. Every time you execute something like sed
, awk
, date
, or even just $(...)
or ... | ...
you start a process. In a loop, this adds up.
Compare time for ((i=0; i<1000; ++i)); do true; done
vs. time for ((i=0; i<1000; ++i)); do /bin/true; done
. The former uses bash's built-in command and therefore does not start other processes; it immediately finishes. The latter uses an external program and therefore repeatedly starts a process; it takes 4.5s seconds on my system.
Is there a faster way to do this? Using one of GNU utils (sed, awk etc), bash or python.
Yes. If you rewrite your script in python it will run tremendously faster, assuming you use pythons built-in functions, instead of repeatedly calling sp = subprocess.run(["date", ...], stdout=subprocess.PIPE])
and newDate = sp.stdout
and so on :)
When writing it that way, you would immediately notice that this cannot be effective. bash makes it so easy to run other programs that you often forget all the work that is done behind the scenes.
But since you tagged your question as bash, lets stick to a script solution.
The transformation of MMM
to MM
(e.g. Jan
to 01
) is a bit tricky for sed
. We have to use a separate replacement for each month. Luckily, the month is always at the beginning, so we can replace it separately from the rest of the date.
To add a leading zero to single digit days we use an additional replacement.
sed -i.bak -E -e's/^Jan/01/;s/^Feb/02/;s/^Mar/03/;...' \
-e's/^(..) /\1 0/' \
-e's/^([0-9]+) ?([0-9]+) ([0-9]+:[0-9]+:[0-9]+)/2021-\1-\2T\3+02:00/' */*.2021
The first expression can be automatically generated:
monthNameToNumber=$(
printf %s\\n Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec |
awk '{printf "s/^%s/%02d/;", $0, NR}'
)
sed -i.bak -E -e"$monthNameToNumber" \
-e's/^(..) /\1 0/' \
-e's/^([0-9]+) ?([0-9]+) ([0-9]+:[0-9]+:[0-9]+)/2021-\1-\2T\3+02:00/' */*.2021
This replaces all dates at the start of your log lines, in all log files one directory under the current one. The logs will be modified in-place. A backup of each log is created with the suffix .bak
.
Upvotes: 0
Reputation: 189387
You are rewriting the entire file with sed
as many times as you have lines in the file. This is a huge but unfortunately fairly common beginner antipattern.
The pipeline to create the sed
command is also quite overcomplicated and inefficient.
You don't really need date
to convert between date formats when the result will contain exactly the same information in a different order. Try something like
awk -vyyyy="$(date +%Y)" 'BEGIN {
split("Jan:Feb:Mar:Apr:May:Jun:Jul:Aug:Sep:Oct:Nov:Dec", _m, ":");
for(i=1; i<=12; ++i) m[_m[i]] = i }
{ printf "%04i-%02i-%02iT%s+02:00 %s",
yyyy, m[$1], $2, $3, substr($0, 17) }' "$j" >"$j.new"
Demo: https://ideone.com/VBDqB8
Upvotes: 2