proxymoxy
proxymoxy

Reputation: 13

Editing large files efficiently

I have some large logfiles that have the old syslog format dates from RFC3162 (MMM dd HH:mm:ss) that I want to change over to the new syslog format dates from RFC5424 (YYYY-mm-ddTHH:mm:ss +TMZ). I have created the following bash script:

#!/bin/bash

#Loop over directories
for i in $1
do
    echo "Processing directory $i"
    if [ -d $i ]
    then
        cd $i
        #Loop over log files inside the directory
        for j in *.2021
        do
            echo "Processing file $j"
            #Read line by line and perform transformation on dates and append to new file
            cat $j | \
                while read CMD; do
                    tmpdate=$(printf '%s\n' "$CMD" | awk -F" $i" 'BEGIN {ORS=""}; {print $1}')
                    newdate=$(date +'%Y-%m-%dT%H:%M:%S+02:00' -d "$tmpdate")

                    printf '%s\n' "$CMD" | sed 's/'"$tmpdate"'/'"$newdate"'/g' >> $j.new
                done
            mv $j.new $j
        done
        cd ..
    fi
done

But this is taking a looooong time to execute since I have files with several million lines (logs dating back over one year on a mail server for example). So far this has been running for days and still a lot of lines to parse :-)

So two questions.

  1. Why is this script taking such a long time to execute?
  2. Is there a faster way to do this? Using one of GNU utils (sed, awk etc), bash or python.

======== EDIT =======

Here are examples of the old format:

Feb  1 21:59:44 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
Feb  1 21:59:44 calendar 50mounted-tests: debug: /dev/sda2 type not recognised; skipping
Feb  1 21:59:44 calendar os-prober: debug: os detected by /usr/lib/os-probes/50mounted-tests

Note that there are 2 spaces between Feb and 1, if the date is 10 or higher the space is only 1 as in

Feb 10 10:39:53 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2

In the new format it would look like this:

2021-02-01T21:59:44+02:00 calendar os-prober: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sda2
2021-02-01T21:59:44+02:00 calendar 50mounted-tests: debug: /dev/sda2 type not recognised; skipping
2021-02-01T21:59:44+02:00 calendar os-prober: debug: os detected by /usr/lib/os-probes/50mounted-tests

TIA.

Upvotes: 1

Views: 96

Answers (2)

Socowi
Socowi

Reputation: 27215

Why is this script taking such a long time to execute?

Bash is a scripting language and intended to run other programs. Therefore, bash itself as a language isn't very fast. But it gets even worse if you repeatedly start other processes. Starting a process is very costly. Every time you execute something like sed, awk, date, or even just $(...) or ... | ... you start a process. In a loop, this adds up.

Compare time for ((i=0; i<1000; ++i)); do true; done vs. time for ((i=0; i<1000; ++i)); do /bin/true; done. The former uses bash's built-in command and therefore does not start other processes; it immediately finishes. The latter uses an external program and therefore repeatedly starts a process; it takes 4.5s seconds on my system.

Is there a faster way to do this? Using one of GNU utils (sed, awk etc), bash or python.

Yes. If you rewrite your script in python it will run tremendously faster, assuming you use pythons built-in functions, instead of repeatedly calling sp = subprocess.run(["date", ...], stdout=subprocess.PIPE]) and newDate = sp.stdout and so on :)
When writing it that way, you would immediately notice that this cannot be effective. bash makes it so easy to run other programs that you often forget all the work that is done behind the scenes.

But since you tagged your question as bash, lets stick to a script solution.

The transformation of MMM to MM (e.g. Jan to 01) is a bit tricky for sed. We have to use a separate replacement for each month. Luckily, the month is always at the beginning, so we can replace it separately from the rest of the date.
To add a leading zero to single digit days we use an additional replacement.

sed -i.bak -E -e's/^Jan/01/;s/^Feb/02/;s/^Mar/03/;...' \
  -e's/^(..)  /\1 0/' \
  -e's/^([0-9]+)  ?([0-9]+) ([0-9]+:[0-9]+:[0-9]+)/2021-\1-\2T\3+02:00/' */*.2021

The first expression can be automatically generated:

monthNameToNumber=$(
   printf %s\\n Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec |
   awk '{printf "s/^%s/%02d/;", $0, NR}'
)
sed -i.bak -E -e"$monthNameToNumber" \
  -e's/^(..)  /\1 0/' \
  -e's/^([0-9]+)  ?([0-9]+) ([0-9]+:[0-9]+:[0-9]+)/2021-\1-\2T\3+02:00/' */*.2021

This replaces all dates at the start of your log lines, in all log files one directory under the current one. The logs will be modified in-place. A backup of each log is created with the suffix .bak.

Upvotes: 0

tripleee
tripleee

Reputation: 189387

You are rewriting the entire file with sed as many times as you have lines in the file. This is a huge but unfortunately fairly common beginner antipattern.

The pipeline to create the sed command is also quite overcomplicated and inefficient.

You don't really need date to convert between date formats when the result will contain exactly the same information in a different order. Try something like

awk -vyyyy="$(date +%Y)" 'BEGIN {
    split("Jan:Feb:Mar:Apr:May:Jun:Jul:Aug:Sep:Oct:Nov:Dec", _m, ":");
    for(i=1; i<=12; ++i) m[_m[i]] = i }
{ printf "%04i-%02i-%02iT%s+02:00 %s",
    yyyy, m[$1], $2, $3, substr($0, 17) }' "$j" >"$j.new"

Demo: https://ideone.com/VBDqB8

Upvotes: 2

Related Questions