Reputation: 79
How to optimize the following script to be much faster for large files.
./Processed/Ranked_endpoints_*.csv
approximately ~1200 linesDates.csv
approximately 1 million linesWorking shell script (I think? it's still running...):
for RFGAUGE in ./Processed/Ranked_endpoints_*.csv
do
echo "Processing: $RFGAUGE"
mkIndex="$(basename $RFGAUGE)"
echo "$mkIndex"
##Duplicate the dates file to be able to edit it in place later on
cp Dates.csv Timed_${mkIndex%.*}.csv
##Remove header
tail -n +2 $RFGAUGE > noHead_${mkIndex%.*}.csv
##The slow part, go line by line and find if the time matches an event and copy to additional columns
while read LINE
do
##Asign variables from columns and replace , with space
read RankDUR RankRF RankPEAK RankAVG END START <<< `echo $LINE | cut -d, -f9,10,11,13,14,15 | sed 's/,/ /g'`
##Tried using sed and line numbers, failed
#STARTLINE=`grep -nr $START Dates.csv | cut -d: -f1`
#ENDLINE=`grep -nr $END Dates.csv | cut -d: -f1`
##Gawk only so can edit file in place
##Assigning AWK variables from UNIX variables
gawk -i inplace -v start="$START" -v end="$END" -v rankdur="$RankDUR" -v rankrf="$RankRF" -v rankpeak="$RankPEAK" -v rankavg="$RankAVG" 'BEGIN{FS=OFS=","}{if($2>=start && $2<=end) print $0,rankdur,rankrf,rankpeak,rankavg; else print $0}' Timed_${mkIndex%.*}.csv
done < noHead_${mkIndex%.*}.csv
rm noHead_${mkIndex%.*}.csv
done
I am trying to rank the most severe rainfall events based on a few gauges. Problem with the data is that the rainfall events don't start/stop at exactly the same times and are usually few hours offset relative to each other.
I already wrote a script that went from multiple years of data per gauge and extracted what one could call an "event" and then ranked different parameters of the event. Example of what I currently have:
./Processed/Ranked_endpoints_*.csv
Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800
...
Important columns in the CSV above are:
I also created a 15min date/time csv containing dates and time since epoch, this is similar to format I used to extract the "event" data:
Dates.csv
...
03/12/2010 21:45:00,1291412700
03/12/2010 22:00:00,1291413600
03/12/2010 22:15:00,1291414500
03/12/2010 22:30:00,1291415400
03/12/2010 22:45:00,1291416300
03/12/2010 23:00:00,1291417200
03/12/2010 23:15:00,1291418100
03/12/2010 23:30:00,1291419000
03/12/2010 23:45:00,1291419900
04/12/2010 00:00:00,1291420800
04/12/2010 00:15:00,1291421700
04/12/2010 00:30:00,1291422600
04/12/2010 00:45:00,1291423500
04/12/2010 01:00:00,1291424400
04/12/2010 01:15:00,1291425300
04/12/2010 01:30:00,1291426200
04/12/2010 01:45:00,1291427100
04/12/2010 02:00:00,1291428000
04/12/2010 02:15:00,1291428900
04/12/2010 02:30:00,1291429800
04/12/2010 02:45:00,1291430700
04/12/2010 03:00:00,1291431600
04/12/2010 03:15:00,1291432500
04/12/2010 03:30:00,1291433400
04/12/2010 03:45:00,1291434300
04/12/2010 04:00:00,1291435200
04/12/2010 04:15:00,1291436100
04/12/2010 04:30:00,1291437000
04/12/2010 04:45:00,1291437900
04/12/2010 05:00:00,1291438800
04/12/2010 05:15:00,1291439700
04/12/2010 05:30:00,1291440600
...
Considering I have about 20 years of 15min data per gauge, and possibly a large number of gauges. What would be the best way to transfer the Columns 9,10,11,13 to the Dates.csv
if the time matches one of the "events"? Current script above doesn't join different gauges into 1 CSV but that's easy to do with cut/paste.
So final output would be something like the following assuming that rain hit gauge 2 an hour after gauge 1 and lasted 1h less:
03/12/2010 22:00:00,1291413600
03/12/2010 22:15:00,1291414500 ,71,39,12,1
03/12/2010 22:30:00,1291415400 ,71,39,12,1
03/12/2010 22:45:00,1291416300 ,71,39,12,1
03/12/2010 23:00:00,1291417200 ,71,39,12,1
03/12/2010 23:15:00,1291418100 ,71,39,12,1,13,25,35,4
03/12/2010 23:30:00,1291419000 ,71,39,12,1,13,25,35,4
...
04/12/2010 05:00:00,1291438800 ,71,39,12,1,13,25,35,4
04/12/2010 05:15:00,1291439700 ,71,39,12,1,13,25,35,4
04/12/2010 05:30:00,1291440600
Upvotes: 4
Views: 740
Reputation: 203229
It sounds like what you might want to do is run this (using GNU awk for true multi-dimensional arrays and sorted in):
$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 1 { next }
{
ranks = $9 OFS $10 OFS $11 OFS $13
endEpoch = $14
begEpoch = $15
for ( epoch=begEpoch; epoch<=endEpoch; epoch+=(15*60) ) {
epoch2ranks[epoch][++numRanks[epoch]] = ranks
}
}
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for ( epoch in epoch2ranks ) {
printf "%s", epoch
for ( rankNr in epoch2ranks[epoch] ) {
ranks = epoch2ranks[epoch][rankNr]
printf "%s%s", OFS, ranks
}
print ""
}
}
which you'd run as:
$ awk -f tst.awk Ranked_endpoints_*.csv
and then use the UNIX tool join
to join it's output with Dates.csv
.
FWIW given the input you provided in your question:
$ cat file
Date,D,M,Y,WOY, Duration (h),Total RF (mm),Max RF (mm),Rank Duration,Rank Total RF,Rank Max RF,AVG Rank,Rank AVG,EndTime EPOCH, StartTime EPOCH
04/12/2010 05:15:00,4,11,2010,48,7.0,22.599999999999994,8.2,71,39,12,40.6667,1,1291439700,1291414500
17/12/2004 08:00:00,17,11,2004,50,6.5,32.6,5.0,89,12,40,47,2,1103270400,1103247000
25/08/2010 18:00:00,25,7,2010,34,6.5,28.6,4.8,83,20,46,49.6667,3,1282759200,1282735800
It'll produce this output:
$ awk -f tst.awk file
1103247000,89,12,40,2
1103247900,89,12,40,2
1103248800,89,12,40,2
1103249700,89,12,40,2
1103250600,89,12,40,2
1103251500,89,12,40,2
1103252400,89,12,40,2
1103253300,89,12,40,2
1103254200,89,12,40,2
1103255100,89,12,40,2
1103256000,89,12,40,2
1103256900,89,12,40,2
1103257800,89,12,40,2
1103258700,89,12,40,2
1103259600,89,12,40,2
1103260500,89,12,40,2
1103261400,89,12,40,2
1103262300,89,12,40,2
1103263200,89,12,40,2
1103264100,89,12,40,2
1103265000,89,12,40,2
1103265900,89,12,40,2
1103266800,89,12,40,2
1103267700,89,12,40,2
1103268600,89,12,40,2
1103269500,89,12,40,2
1103270400,89,12,40,2
1282735800,83,20,46,3
1282736700,83,20,46,3
1282737600,83,20,46,3
1282738500,83,20,46,3
1282739400,83,20,46,3
1282740300,83,20,46,3
1282741200,83,20,46,3
1282742100,83,20,46,3
1282743000,83,20,46,3
1282743900,83,20,46,3
1282744800,83,20,46,3
1282745700,83,20,46,3
1282746600,83,20,46,3
1282747500,83,20,46,3
1282748400,83,20,46,3
1282749300,83,20,46,3
1282750200,83,20,46,3
1282751100,83,20,46,3
1282752000,83,20,46,3
1282752900,83,20,46,3
1282753800,83,20,46,3
1282754700,83,20,46,3
1282755600,83,20,46,3
1282756500,83,20,46,3
1282757400,83,20,46,3
1282758300,83,20,46,3
1282759200,83,20,46,3
1291414500,71,39,12,1
1291415400,71,39,12,1
1291416300,71,39,12,1
1291417200,71,39,12,1
1291418100,71,39,12,1
1291419000,71,39,12,1
1291419900,71,39,12,1
1291420800,71,39,12,1
1291421700,71,39,12,1
1291422600,71,39,12,1
1291423500,71,39,12,1
1291424400,71,39,12,1
1291425300,71,39,12,1
1291426200,71,39,12,1
1291427100,71,39,12,1
1291428000,71,39,12,1
1291428900,71,39,12,1
1291429800,71,39,12,1
1291430700,71,39,12,1
1291431600,71,39,12,1
1291432500,71,39,12,1
1291433400,71,39,12,1
1291434300,71,39,12,1
1291435200,71,39,12,1
1291436100,71,39,12,1
1291437000,71,39,12,1
1291437900,71,39,12,1
1291438800,71,39,12,1
1291439700,71,39,12,1
but idk if that's what you want or not since the sample output in your question doesn't seem match up with the sample input. If it is then you'd just run join
using the 2nd field from Dates.csv and the first field from the above output as the fields to match on and with comma as the field separator.
Upvotes: 3