Reputation: 895
I need to compare a set of variables from file 'tmpcsv2' with variables in 'uniq_id', I'm detailing the files below.
tmpcsv2 -> This file gets updated by another script 'script1' and each run of 'script1' updates (not append) new variables in 'tmpcsv2'. The no. of variables might be 1 and can go till 200.
eg:
2042344352
2470697747
2635527510
3667769962
uniq-id -> This is a fixed set of variables (about 100K in no.)
(Business Name,Job ID,Job Size)
biz,1000036446,225210640
biz,100006309,6710840
biz,1000069211,2084019000
biz,1000118720,34194040
biz,1000150241,212322636
I'm using 'for' loops + 'if' to compare them as shown below, is there a easier or faster (less impact) way of doing this ? When I run this it take a very long time to output results. The print commands are just for testing and will be removed later!
****Part of a bigger script****
amt=0
mjc=0
for jbid in `cat tmpcsv2` #Pick ID for match & calculation
do
printf "Checking ID $jbid\n" >> Acsv3.tmp
for bsid in `cat uniq_id` #Matching jobs & size calulation
do
ckid=`echo $bsid | cut -d "," -f2` #ckid is the ID to check
jbsiz=`echo $bsid | cut -d "," -f3` #size of the ID
if [ $jbid == $ckid ]
then
printf "Matched at $ckid\n" #Print on Match found
printf "Valid -> $jbid\n" >> Bcsv3.tmp
((mjc++)) #Increment Matched Job Count
amt=$((amt+jbsiz)) #Add size of matched jobs
break
else
printf "No Match at $cksid\n" #No matches
fi
done
printf "Check for ID $jbid done\n" >> Acsv3.tmp
printf "Matched $mjc jobs with combined size of $amt\n" >> Acsv3.tmp
done
****End of Comparision****
Upvotes: 1
Views: 98
Reputation: 895
I have come up with this, not sure if this can be shortened, but it sure runs faster ! Any help will be much appreciated !
************
while read -r line #File read start
do
IFS=$","
val=$line
amt=0
mjc=0
cjc=0
for lsid in $val
do
cksid=`echo $lsid | sed -e 's/*//g' -e 's/"//g'`
printf "Checking for $cksid\n"
((cjc++)) #Count of jobs to check
prsnt=`grep -w $cksid uniq_id`
if [ $? -eq 0 ]
then
printf "Valid -> $cksid\n"
jbsiz=`grep -w $prsnt | cut -d, -f2`
(( mjc++, amt += jbsiz ))
break
else
printf "No Data for $cksid\n"
fi
done
done < tmpcsv2
***********
Upvotes: 0
Reputation: 6587
A shell is the wrong tool for crunching this much data, but it's doable. The most basic mistake here is reading lines with for
. Performance may be improved significantly by not re-opening the files on each iteration.
function main {
# Variables used elsewhere should be initialized there, not localized here.
typeset amt=0 mjc=0 jbid ckid jbsiz
while IFS= read -r jbid; do
printf 'Checking ID %s\n' "$jbid" >&3
while IFS=, read -r _ ckid jbsiz _; do
case $jbsiz in
*[^[:digit:]]*|'')
# validation is important for subsequent arithmetic.
return 1
;;
"$ckid") # Assuming "cksid" was a typo. Replace if not.
printf 'Matched at %s\n' "$ckid"
printf 'Valid -> %s\n' "$jbid" >&4
(( mjc++, amt += jbsiz ))
break
;;
*)
printf 'No match at %s\n' "$ckid"
esac
done <uniqid
{
printf 'Check for ID %s done\n' "$jbid"
printf 'Matched %s jobs with combined size of %s\n' "$mjc" "$amt"
} >&3
done <tmpcsv2 3>>Acsv3.tmp 4>>Bcsv3.tmp
}
Finally, an equivalent awk script will significantly outperform this Bash script, as will nearly any other language. You can also get a lot more performance out of Bash by using mapfile
instead of a read loop, but this nested read loop logic is a bit sloppy to emulate using mapfile
callbacks.
Upvotes: 1