Reputation: 117
I've written a simple parser in BASH to take apart csv files and dump to a (temp) SQL-input file. The performance on this is pretty terrible; when running on a modern system I'm barely cracking 100 lines per second. I realize the ultimate answer is to rewrite this in a more performance oriented language, but as a learning opportunity, I'm curious where I can improve my BASH skills. I suspect there are gains to be made by writing to an ram instead of to a file, then flushing all the text at once to the file, but I'm not clear on where/when BASH gets upset about memory usage (largest files I've parsed have been under 500MB).
The following code-block seems to eat most of the cycles, and as I understand, needs to be processed linearly due to checking timestamps (the data has a timestamp, but no timedate stamp, so I was forced ask the user for the start-day and check if the timestamp has cycled 24:00 -> 0:00), so parallel processing didn't seem like an option.
while read p; do
linetime=`printf "${p}" | awk '{printf $1}'`
# THE DATA LACKS FULL DATESTAMPS, SO FORCED TO ASK USER FOR START-DAY & CHECK IF THE DATE HAS CYCLED
if [[ "$lastline" > "$linetime" ]]
then
experimentdate=$(eval $datecmd)
fi
lastline=$linetime
printf "$p" | awk -v varout="$projname" -v experiment_day="$experimentdate " -v singlequote="$cleanquote" '{printf "insert into tool (project,project_datetime,reported_time,seconds,intensity) values ("singlequote""varout""singlequote","singlequote""experiment_day $1""singlequote","singlequote""$1""singlequote","$2","$3");\n"}' >> $sql_input_file
Ignore the singlequote nonsense, I needed this to run on both OSX & 'nix, so I had to workaround some issues with OSX's awk and singlequotes.
Any suggestions for how I can improve performance?
Upvotes: 1
Views: 843
Reputation: 19982
You do not want to start awk for every line you process in a loop. Replace your loop with awk or replace awk with builtin commands.
Both awk's are only used for printing. Replace these lines with additional parameters to the printf
command.
I did not understand the codeblock for datecmd
(not using $linetime
but using the output variable experimentdate
), but this one should be optimised: Can you use regular expressions or some other trick?
So you do not have the tune awk
, but decide to use awk completely or get it out of your while-loop.
Upvotes: 2
Reputation: 1184
Your performance would improve if you did all the processing with awk. Awk can read your input file directly, express conditionals, and run external commands.
Awk is not the only one either. Perl and Python would be well suited to this task.
Upvotes: 2