Pavel D.
Pavel D.

Reputation: 79

Optimize an awk script for large files

TLDR Version

How to optimize the following script to be much faster for large files.

Working shell script (I think? it's still running...):

for RFGAUGE in ./Processed/Ranked_endpoints_*.csv
do
    echo "Processing: $RFGAUGE"
    mkIndex="$(basename $RFGAUGE)"
    echo "$mkIndex"

    ##Duplicate the dates file to be able to edit it in place later on
    cp Dates.csv Timed_${mkIndex%.*}.csv

    ##Remove header
    tail -n +2 $RFGAUGE > noHead_${mkIndex%.*}.csv

    ##The slow part, go line by line and find if the time matches an event and copy to additional columns
    while read LINE
    do
        ##Asign variables from columns and replace , with space
        read RankDUR RankRF RankPEAK RankAVG END START <<< `echo $LINE | cut -d, -f9,10,11,13,14,15 | sed 's/,/ /g'`

        ##Tried using sed and line numbers, failed 
        #STARTLINE=`grep -nr $START Dates.csv | cut -d: -f1`
        #ENDLINE=`grep -nr $END Dates.csv | cut -d: -f1`

        ##Gawk only so can edit file in place
        ##Assigning AWK variables from UNIX variables
        gawk -i inplace -v start="$START" -v end="$END" -v rankdur="$RankDUR" -v rankrf="$RankRF" -v rankpeak="$RankPEAK" -v rankavg="$RankAVG" 'BEGIN{FS=OFS=","}{if($2>=start && $2<=end) print $0,rankdur,rankrf,rankpeak,rankavg; else print $0}' Timed_${mkIndex%.*}.csv

    done < noHead_${mkIndex%.*}.csv

    rm noHead_${mkIndex%.*}.csv

done

Long version

I am trying to rank the most severe rainfall events based on a few gauges. Problem with the data is that the rainfall events don't start/stop at exactly the same times and are usually few hours offset relative to each other.

I already wrote a script that went from multiple years of data per gauge and extracted what one could call an "event" and then ranked different parameters of the event. Example of what I currently have:

./Processed/Ranked_endpoints_*.csv

Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800
...

Important columns in the CSV above are:

I also created a 15min date/time csv containing dates and time since epoch, this is similar to format I used to extract the "event" data:

Dates.csv

...
03/12/2010 21:45:00,1291412700 
03/12/2010 22:00:00,1291413600 
03/12/2010 22:15:00,1291414500 
03/12/2010 22:30:00,1291415400 
03/12/2010 22:45:00,1291416300 
03/12/2010 23:00:00,1291417200 
03/12/2010 23:15:00,1291418100 
03/12/2010 23:30:00,1291419000 
03/12/2010 23:45:00,1291419900 
04/12/2010 00:00:00,1291420800 
04/12/2010 00:15:00,1291421700 
04/12/2010 00:30:00,1291422600 
04/12/2010 00:45:00,1291423500 
04/12/2010 01:00:00,1291424400 
04/12/2010 01:15:00,1291425300 
04/12/2010 01:30:00,1291426200 
04/12/2010 01:45:00,1291427100 
04/12/2010 02:00:00,1291428000 
04/12/2010 02:15:00,1291428900 
04/12/2010 02:30:00,1291429800 
04/12/2010 02:45:00,1291430700 
04/12/2010 03:00:00,1291431600 
04/12/2010 03:15:00,1291432500 
04/12/2010 03:30:00,1291433400 
04/12/2010 03:45:00,1291434300 
04/12/2010 04:00:00,1291435200 
04/12/2010 04:15:00,1291436100 
04/12/2010 04:30:00,1291437000 
04/12/2010 04:45:00,1291437900 
04/12/2010 05:00:00,1291438800 
04/12/2010 05:15:00,1291439700 
04/12/2010 05:30:00,1291440600
...

Considering I have about 20 years of 15min data per gauge, and possibly a large number of gauges. What would be the best way to transfer the Columns 9,10,11,13 to the Dates.csv if the time matches one of the "events"? Current script above doesn't join different gauges into 1 CSV but that's easy to do with cut/paste.

So final output would be something like the following assuming that rain hit gauge 2 an hour after gauge 1 and lasted 1h less:

03/12/2010 22:00:00,1291413600
03/12/2010 22:15:00,1291414500 ,71,39,12,1
03/12/2010 22:30:00,1291415400 ,71,39,12,1
03/12/2010 22:45:00,1291416300 ,71,39,12,1
03/12/2010 23:00:00,1291417200 ,71,39,12,1
03/12/2010 23:15:00,1291418100 ,71,39,12,1,13,25,35,4
03/12/2010 23:30:00,1291419000 ,71,39,12,1,13,25,35,4
...
04/12/2010 05:00:00,1291438800 ,71,39,12,1,13,25,35,4
04/12/2010 05:15:00,1291439700 ,71,39,12,1,13,25,35,4
04/12/2010 05:30:00,1291440600

Upvotes: 4

Views: 740

Answers (1)

Ed Morton
Ed Morton

Reputation: 203229

It sounds like what you might want to do is run this (using GNU awk for true multi-dimensional arrays and sorted in):

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 1 { next }
{
    ranks    = $9 OFS $10 OFS $11 OFS $13
    endEpoch = $14
    begEpoch = $15

    for ( epoch=begEpoch; epoch<=endEpoch; epoch+=(15*60) ) {
        epoch2ranks[epoch][++numRanks[epoch]] = ranks
    }
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for ( epoch in epoch2ranks ) {
        printf "%s", epoch
        for ( rankNr in epoch2ranks[epoch] ) {
            ranks = epoch2ranks[epoch][rankNr]
            printf "%s%s", OFS, ranks
        }
        print ""
    }
}

which you'd run as:

$ awk -f tst.awk Ranked_endpoints_*.csv

and then use the UNIX tool join to join it's output with Dates.csv.

FWIW given the input you provided in your question:

$ cat file
Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800

It'll produce this output:

$ awk -f tst.awk file
1103247000,89,12,40,2
1103247900,89,12,40,2
1103248800,89,12,40,2
1103249700,89,12,40,2
1103250600,89,12,40,2
1103251500,89,12,40,2
1103252400,89,12,40,2
1103253300,89,12,40,2
1103254200,89,12,40,2
1103255100,89,12,40,2
1103256000,89,12,40,2
1103256900,89,12,40,2
1103257800,89,12,40,2
1103258700,89,12,40,2
1103259600,89,12,40,2
1103260500,89,12,40,2
1103261400,89,12,40,2
1103262300,89,12,40,2
1103263200,89,12,40,2
1103264100,89,12,40,2
1103265000,89,12,40,2
1103265900,89,12,40,2
1103266800,89,12,40,2
1103267700,89,12,40,2
1103268600,89,12,40,2
1103269500,89,12,40,2
1103270400,89,12,40,2
1282735800,83,20,46,3
1282736700,83,20,46,3
1282737600,83,20,46,3
1282738500,83,20,46,3
1282739400,83,20,46,3
1282740300,83,20,46,3
1282741200,83,20,46,3
1282742100,83,20,46,3
1282743000,83,20,46,3
1282743900,83,20,46,3
1282744800,83,20,46,3
1282745700,83,20,46,3
1282746600,83,20,46,3
1282747500,83,20,46,3
1282748400,83,20,46,3
1282749300,83,20,46,3
1282750200,83,20,46,3
1282751100,83,20,46,3
1282752000,83,20,46,3
1282752900,83,20,46,3
1282753800,83,20,46,3
1282754700,83,20,46,3
1282755600,83,20,46,3
1282756500,83,20,46,3
1282757400,83,20,46,3
1282758300,83,20,46,3
1282759200,83,20,46,3
1291414500,71,39,12,1
1291415400,71,39,12,1
1291416300,71,39,12,1
1291417200,71,39,12,1
1291418100,71,39,12,1
1291419000,71,39,12,1
1291419900,71,39,12,1
1291420800,71,39,12,1
1291421700,71,39,12,1
1291422600,71,39,12,1
1291423500,71,39,12,1
1291424400,71,39,12,1
1291425300,71,39,12,1
1291426200,71,39,12,1
1291427100,71,39,12,1
1291428000,71,39,12,1
1291428900,71,39,12,1
1291429800,71,39,12,1
1291430700,71,39,12,1
1291431600,71,39,12,1
1291432500,71,39,12,1
1291433400,71,39,12,1
1291434300,71,39,12,1
1291435200,71,39,12,1
1291436100,71,39,12,1
1291437000,71,39,12,1
1291437900,71,39,12,1
1291438800,71,39,12,1
1291439700,71,39,12,1

but idk if that's what you want or not since the sample output in your question doesn't seem match up with the sample input. If it is then you'd just run join using the 2nd field from Dates.csv and the first field from the above output as the fields to match on and with comma as the field separator.

Upvotes: 3

Related Questions