Reputation: 895

Easier way to compare variables

I need to compare a set of variables from file 'tmpcsv2' with variables in 'uniq_id', I'm detailing the files below.

tmpcsv2 -> This file gets updated by another script 'script1' and each run of 'script1' updates (not append) new variables in 'tmpcsv2'. The no. of variables might be 1 and can go till 200.

uniq-id -> This is a fixed set of variables (about 100K in no.)

(Business Name,Job ID,Job Size)
biz,1000036446,225210640
biz,100006309,6710840
biz,1000069211,2084019000
biz,1000118720,34194040
biz,1000150241,212322636

I'm using 'for' loops + 'if' to compare them as shown below, is there a easier or faster (less impact) way of doing this ? When I run this it take a very long time to output results. The print commands are just for testing and will be removed later!

****Part of a bigger script****
amt=0
mjc=0
for jbid in `cat tmpcsv2` #Pick ID for match & calculation
do
    printf "Checking ID $jbid\n" >> Acsv3.tmp
    for bsid in `cat uniq_id` #Matching jobs & size calulation
    do
        ckid=`echo $bsid | cut -d "," -f2` #ckid is the ID to check
        jbsiz=`echo $bsid | cut -d "," -f3` #size of the ID
        if [ $jbid == $ckid ] 
        then
            printf "Matched at $ckid\n" #Print on Match found
            printf "Valid -> $jbid\n" >> Bcsv3.tmp
            ((mjc++)) #Increment Matched Job Count
            amt=$((amt+jbsiz)) #Add size of matched jobs
            break
        else
            printf "No Match at $cksid\n" #No matches
        fi
    done
    printf "Check for ID $jbid done\n" >> Acsv3.tmp
    printf "Matched $mjc jobs with combined size of $amt\n" >> Acsv3.tmp
done
****End of Comparision****

Upvotes: 1

Answers (2)

Marcos

Reputation: 895

I have come up with this, not sure if this can be shortened, but it sure runs faster ! Any help will be much appreciated !

************
while read -r line  #File read start
do
IFS=$","
val=$line
amt=0
mjc=0
cjc=0
for lsid in $val
do
    cksid=`echo $lsid | sed -e 's/*//g' -e 's/"//g'`
    printf "Checking for $cksid\n"
    ((cjc++)) #Count of jobs to check
    prsnt=`grep -w $cksid uniq_id`
    if [ $? -eq 0 ]
    then
        printf "Valid -> $cksid\n"
        jbsiz=`grep -w $prsnt | cut -d, -f2`
        (( mjc++, amt += jbsiz ))
        break
    else
        printf "No Data for $cksid\n"
    fi

done
done < tmpcsv2
***********

Upvotes: 0

ormaaj

Reputation: 6587

A shell is the wrong tool for crunching this much data, but it's doable. The most basic mistake here is reading lines with for. Performance may be improved significantly by not re-opening the files on each iteration.

function main {
    # Variables used elsewhere should be initialized there, not localized here.
    typeset amt=0 mjc=0 jbid ckid jbsiz

    while IFS= read -r jbid; do
        printf 'Checking ID %s\n' "$jbid" >&3
        while IFS=, read -r _ ckid jbsiz _; do
            case $jbsiz in
                *[^[:digit:]]*|'')
                    # validation is important for subsequent arithmetic.
                    return 1
                    ;;
                "$ckid") # Assuming "cksid" was a typo. Replace if not.
                    printf 'Matched at %s\n' "$ckid"
                    printf 'Valid -> %s\n' "$jbid" >&4
                    (( mjc++, amt += jbsiz ))
                    break
                    ;;
                *)
                    printf 'No match at %s\n' "$ckid"
            esac
        done <uniqid
        {
            printf 'Check for ID %s done\n' "$jbid"
            printf 'Matched %s jobs with combined size of %s\n' "$mjc" "$amt"
        } >&3
    done <tmpcsv2 3>>Acsv3.tmp 4>>Bcsv3.tmp
}

Finally, an equivalent awk script will significantly outperform this Bash script, as will nearly any other language. You can also get a lot more performance out of Bash by using mapfile instead of a read loop, but this nested read loop logic is a bit sloppy to emulate using mapfile callbacks.

Upvotes: 1

Easier way to compare variables

Answers (2)

Related Questions