pathfinder
pathfinder

Reputation: 13

How to speed up a bash script?

I have a very large tab separated text file which i am parsing to obtain certain data. Since the input file is very large the script is very slow, how can i speed up?

I tried using & and wait which result is bit slower and using nice (checked using time)

Update Few lines of input.tsv

Names   Number  Cylinder    torque  HP  cc  others
chevrolet   18  8   307 130 3504    SLR=0.1;MIN=5;MAX=19;PR=0.008;SUM=27;SD=0.5;IQR=9.5;RANG=7.5;MP_R=0.0177;MX_R=9.118
buick   15  8   350 165 3693    SLR=0.7;MIN=7;MAX=17;PR=0.07;SUM=30;SD=2.5;IQR=7.5;RANG=9.5;MP_R=0.0197;MX_R=9.1541
satellite   18  8   318 150 3436    SLR=0.12;MIN=2;MAX=11;PR=0.065;SUM=17;SD=5.5;IQR=11.5;RANG=6.5;MP_R=0.0377;MX_R=9.154
rebel   16  8   304 150 3433    SLR=0.61;MIN=8;MAX=15;PR=0.04148;SUM=24;SD=4.5;IQR=12.5;RANG=9.5;MP_R=0.018;MX_R=9.186
torino  17  8   302 140 3449    SLR=0.2;MIN=4;MAX=14;PR=0.018;SUM=22;SD=1.5;IQR=7.5;RANG=5.5;MP_R=0.0141;MX_R=9.115

Thank you

extract.sh

#!/bin/bash
zcat input.tsv.gz | while read a b c d e f g;
        do
        m=$(echo  $g | awk -v key="MAX" -v RS=';' -F'=' '$1==key{print$2}')
        n=$(echo  $g | awk -v key="MIN" -v RS=';' -F'=' '$1==key{print$2}')
        o=$(echo  $g | awk -v key="SUM" -v RS=';' -F'=' '$1==key{print$2}')
        p=$(echo  $g | awk -v key="SD" -v RS=';' -F'=' '$1==key{print$2}')
        q=$(echo  $g | awk -v key="IQR" -v RS=';' -F'=' '$1==key{print$2}')
        r=$(echo  $g | awk -v key="RANG" -v RS=';' -F'=' '$1==key{print$2}')
        data=$(printf "$a\t$b\t$c\t$d\t$e\t$f\tMAX=$m\tMIN=$n\tSUM=$o\tSD=$p\tIQR=$q\tRANG=$r")
        echo $data
done

How do i modify to run with xargs or parallel to speed up the process or instruct to use more resources of the computer?

Upvotes: 1

Views: 177

Answers (1)

Fravadona
Fravadona

Reputation: 16960

in each of your records, the semi-colon delimited fields seems to contain the same keywords, in the same order, so you should be able to do something like this:

#!/bin/bash
zcat input.tsv.gz |
awk '
    BEGIN { OFS = "\t" }
    NR > 1 {
        split($7, a, ";")
        print $1,$2,$3,$4,$5,$6,a[3],a[2],a[5],a[6],a[7],a[8]
    }
'
chevrolet   18  8   307 130 3504    MAX=19  MIN=5   SUM=27  SD=0.5  IQR=9.5 RANG=7.5
buick   15  8   350 165 3693    MAX=17  MIN=7   SUM=30  SD=2.5  IQR=7.5 RANG=9.5
satellite   18  8   318 150 3436    MAX=11  MIN=2   SUM=17  SD=5.5  IQR=11.5    RANG=6.5
rebel   16  8   304 150 3433    MAX=15  MIN=8   SUM=24  SD=4.5  IQR=12.5    RANG=9.5
torino  17  8   302 140 3449    MAX=14  MIN=4   SUM=22  SD=1.5  IQR=7.5 RANG=5.5

UPDATE

As the semi-colon delimited fields can appear in random order, you'll need further processing for getting the correct ones:

zcat input.tsv.gz |
awk '
    BEGIN { OFS = "\t" }
    NR > 1 {
        delete f
        n = split($7, a, ";")
        for (i = 1; i <= n; i++) {
            match(a[i],/^[^=]*/)
            f[ substr(a[i],RSTART,RLENGTH) ] = a[i]
        }
        print $1,$2,$3,$4,$5,$6,f["MAX"],f["MIN"],f["SUM"],f["SD"],f["IQR"],f["RANG"]
    }
'

Upvotes: 3

Related Questions