usert4jju7
usert4jju7

Reputation: 1813

Improve performance of bash script

I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.

I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?

for k in *.csv; do
    sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
    sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
    sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
    sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
    sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done

Upvotes: 1

Views: 480

Answers (3)

markp-fuso
markp-fuso

Reputation: 35556

Current performance-related issues:

  • reading each input file 5x times => we want to limit this to a single read per input file
  • calling date 5x times (necessary) for each input file (unnecessary) => make the 5x date calls prior to the for k in *.csv loop [NOTE: the overhead for repeated date calls will pale in comparison to the repeated reads of the input files]

Potential operational issue:

sed is not designed for doing comparisons of data (eg, look for a string that is >= a search pattern); consider an input file like such:

$ cat input.csv
2021-01-25
2021-03-01

If 'today' is 2021-03-14 then for the 1month dataset the current sed solution is:

sed '/2012-02/,$p'

But because there are no entries for 2012-02 the sed command returns 0 rows, even though we should see the row for 2021-03-01.

Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.

Anyhoo, back to the question at hand ...


Assumptions:

  • input files are comma-delimited (otherwise need to adjust the proposed solution)
  • the date to be tested is of the format YYYY-MM-...
  • the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
  • output filename prefix is the input filename sans the .csv

Sample input:

$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

We only need to do the date calculations once so we'll do this in bash and prior to OP's for k in *.csv loop:

# date as of writing this answer: 2021-10-12

$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')

$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"

One awk idea (replaces all of the sed calls in OP's current for k in *.csv loop):

# determine prefix to be used for output files ...

$ k=input.csv

$ prefix="${k//.csv/}"
$ echo "${prefix}"
input

awk -v yr2="${yr2}"       \
    -v yr1="${yr1}"       \
    -v mon6="${mon6}"     \
    -v mon3="${mon3}"     \
    -v mon1="${mon1}"     \
    -v prefix="${prefix}" \
    -F ',' '                                # define input field delimiter as comma

{ split($1,arr,"-")                         # date to be compared is in field #1
  testdate=arr[1] "-" arr[2]

  if ( testdate >= yr2  ) print $0 > prefix".2years.csv"
  if ( testdate >= yr1  ) print $0 > prefix".1year.csv"
  if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
  if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
  if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"

NOTE: awk can dyncially process the input filename to determine the filename prefix (see FILENAME variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)

This generates the following files:

for f in "${prefix}".*.csv
do
    echo "############# ${f}"
    cat "${f}"
    echo ""
done

############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16

Additional performance improvements:

  • [especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the awk solution
  • processing X number of input files in parallel, eg, awk code could be placed in a bash function and then called via <function_name> <input_file> &; this can be done via bash loop controls, parallel, xargs, etc
  • if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention

Upvotes: 2

Ole Tange
Ole Tange

Reputation: 33748

Something like this may work if you have GNU Parallel (version >= 20191022):

do_one() {
  cat "$1" | parallel --pipe --tee sed -n '/$(date -d {}" ago" '+%Y-%m')/,\$p' "> temp_data_store/$1.{}.csv" \
    ::: "2 year" "1 year" "6 month" "3 month" "1 month"
}
export -f do_one
parallel -j1 do_one ::: *.csv

If your disks are fast: Remove -j1.

Upvotes: 1

ceving
ceving

Reputation: 23881

You read each CSV five times. It would be better to read each CSV only once.

You extract the same data multiple times. All but one parts are subsets of the others.

  • 2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
  • 1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
  • 6 months ago is a subset of 3 months ago and 1 month ago.
  • 3 months ago is a subset of 1 month ago.

This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee.

The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)

sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv

Upvotes: 2

Related Questions