Reputation: 1813
I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.
I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?
for k in *.csv; do
sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done
Upvotes: 1
Views: 480
Reputation: 35556
Current performance-related issues:
date
5x times (necessary) for each input file (unnecessary) => make the 5x date
calls prior to the for k in *.csv
loop [NOTE: the overhead for repeated date
calls will pale in comparison to the repeated reads of the input files]Potential operational issue:
sed
is not designed for doing comparisons of data (eg, look for a string that is >=
a search pattern); consider an input file like such:
$ cat input.csv
2021-01-25
2021-03-01
If 'today' is 2021-03-14
then for the 1month
dataset the current sed
solution is:
sed '/2012-02/,$p'
But because there are no entries for 2012-02
the sed
command returns 0 rows, even though we should see the row for 2021-03-01
.
Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.
Anyhoo, back to the question at hand ...
Assumptions:
YYYY-MM-...
.csv
Sample input:
$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
We only need to do the date calculations once so we'll do this in bash
and prior to OP's for k in *.csv
loop:
# date as of writing this answer: 2021-10-12
$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')
$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"
One awk
idea (replaces all of the sed
calls in OP's current for k in *.csv
loop):
# determine prefix to be used for output files ...
$ k=input.csv
$ prefix="${k//.csv/}"
$ echo "${prefix}"
input
awk -v yr2="${yr2}" \
-v yr1="${yr1}" \
-v mon6="${mon6}" \
-v mon3="${mon3}" \
-v mon1="${mon1}" \
-v prefix="${prefix}" \
-F ',' ' # define input field delimiter as comma
{ split($1,arr,"-") # date to be compared is in field #1
testdate=arr[1] "-" arr[2]
if ( testdate >= yr2 ) print $0 > prefix".2years.csv"
if ( testdate >= yr1 ) print $0 > prefix".1year.csv"
if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"
NOTE: awk
can dyncially process the input filename to determine the filename prefix (see FILENAME
variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)
This generates the following files:
for f in "${prefix}".*.csv
do
echo "############# ${f}"
cat "${f}"
echo ""
done
############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
Additional performance improvements:
awk
solutionawk
code could be placed in a bash
function and then called via <function_name> <input_file> &
; this can be done via bash
loop controls, parallel
, xargs
, etcUpvotes: 2
Reputation: 33748
Something like this may work if you have GNU Parallel (version >= 20191022):
do_one() {
cat "$1" | parallel --pipe --tee sed -n '/$(date -d {}" ago" '+%Y-%m')/,\$p' "> temp_data_store/$1.{}.csv" \
::: "2 year" "1 year" "6 month" "3 month" "1 month"
}
export -f do_one
parallel -j1 do_one ::: *.csv
If your disks are fast: Remove -j1
.
Upvotes: 1
Reputation: 23881
You read each CSV five times. It would be better to read each CSV only once.
You extract the same data multiple times. All but one parts are subsets of the others.
This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee
.
The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)
sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv
Upvotes: 2