Reputation: 5745

Doubts about bash script efficiency

I have to accomplish a relatively simple task, basically i have an enormous amount of files with the following format

"2014-01-27","07:20:38","data","data","data"

Basically i would like to extract the first 2 fields, convert them into a unix epoch date, add 6 hours to it (due timezone difference), and replace the first 2 original columns with the resulting milliseconds (unix epoch, since 19700101 converted to mills) I have written a script that works fine, well, the issue is that is very very slow, i need to run this over 150 files with a total line count of more then 5.000.000 and i was wondering if you had any advice about how could i make it faster, here it is:

    #!/bin/bash

function format()
{
while read line; do
 entire_date=$(echo ${line} | cut -d"," -f1-2);
 trimmed_date=$(echo ${entire_date} | sed 's/"//g;s/,/ /g');
 seconds=$(date -d "${trimmed_date} + 6 hours" +%s);
 millis=$((${seconds} * 1000));
 echo ${line} | sed "s/$entire_date/\"$millis\"/g" >> "output"
done < $*
}

format $*

Upvotes: 0

Answers (3)

Edouard Thiel

Reputation: 6228

I have tried to avoid external commands (except date) to gain time. Tests show that it is 4 times faster than your code. (Okay, the tripleee's perl solution is 40 times faster than mine !)

#! /bin/bash

function format()
{
    while IFS=, read date0 date1 datas; do
        date0="${date0//\"/}"
        date1="${date1//\"/}"
        seconds=$(date -d "$date0 $date1 + 6 hours" +%s)
        echo "\"${seconds}000\",$datas" 
    done
}

output="output.txt"

# Process each file in argument
for file ; do
    format < "$file"
done >| "$output"

exit 0

Upvotes: 3

BMW

Reputation: 45293

Using the exist function mktime in awk, tested, it is faster than perl.

awk  '{t=$2 " " $4;gsub(/[-:]/," ",t);printf "\"%s\",%s\n",(mktime(t)+6*3600)*1000,substr($0,25)}' FS=\" OFS=\" file

Here is the test result.

    $ wc -l file
    1244 file
    $ time awk  '{t=$2 " " $4;gsub(/[-:]/," ",t);printf "\"%s\",%s\n",(mktime(t)+6*3600)*1000,substr($0,25)}' FS=\" OFS=\" file > /dev/null

real    0m0.172s
user    0m0.140s
sys     0m0.046s

    $ time perl -MDate::Parse -pe 'die "$0:$ARGV:$.: Unexpected input $_"
        unless s/(?<=^")([^"]+)","([^"]+)(?=")/ (str2time("$1 $2")+6*3600)*1000 /e' file > /dev/null

real    0m0.328s
user    0m0.218s
sys     0m0.124s

Upvotes: 2

tripleee

Reputation: 189729

You are spawning a significant number of processes for each input line. Probably half of those could easily be factored away, by quick glance, but I would definitely recommend a switch to Perl or Python instead.

perl -MDate::Parse -pe 'die "$0:$ARGV:$.: Unexpected input $_"
    unless s/(?<=^")([^"]+)","([^"]+)(?=")/ (str2time("$1 $2")+6*3600)*1000 /e'

I'd like to recommend Text::CSV but I do not have it installed here, and if you have requirements to not touch the fields after the second at all, it might not be what you need anyway. This is quick and dirty but probably also much simpler than a "proper" CSV solution.

The real meat is the str2time function from Date::Parse, which I imagine will be a lot quicker than repeatedly calling date (ISTR it does some memoization internally so it can do nearby dates quickly). The regex replaces the first two fields with the output; note the /e flag which allows Perl code to be evaluated in the replacement part. The (?<=^") and (?=") zero-width assertions require these matches to be present but does not include them in the substitution operation. (I originally substituted the enclosing double quotes, but with this change, they are retained, as apparently you want to keep them.)

Change the die to a warn if you want the script to continue in spite of errors (maybe redirect standard error to a file then!)

Upvotes: 3

Doubts about bash script efficiency

Answers (3)

Related Questions