Reputation: 33
I need to read data from a file and insert into multiple files (with each file less than 3mb in size, file sizes can be different). The important thing is - records for an Agent shouldn't be split across multiple files. I am doing all of this in a While loop in a UNIX bash script.
Input.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Desired Output -
Output1.csv (less than 3MB)
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Output2.csv (less than 3MB)
Src,AgentNum,PhoneNum
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Bash Shell Script
#!/bin/bash
BaseFileName=$(basename $FileName | cut -d. -f1)
Header=`head -1 $FileName`
MaxFileSize=$(( 3 * 1024 * 1024 ))
sed 1d $FileName |
while read -r line
do
echo $line >> ${BaseFileName}_${FileSeq}.csv
MatchCount=`echo $line | grep -c -E '^.DWH'`
if [[ $MatchCount -eq 1 ]]
then
FileSizeBytes=`du -b ${BaseFileName}_${FileSeq}.csv | cut -f1`
if [[ $FileSizeBytes -gt $MaxFileSize ]]
then
#Add a header record to each file
sed -i "1i ${Header}" ${BaseFileName}_${FileSeq}.csv
FileSeq=$((FileSeq + 1))
fi
fi
done
It is working almost fine except 1) It is not splitting the records as expected (some records for an Agent are split across multiple files) 2) It is inserting header record only for the first output file. 3) Too slow, it took 3 minutes for a 10MB file. In reality I have a 3GB file.
Can somebody please suggest me where I am doing it wrong. Is there any much better way to handle this?
Upvotes: 3
Views: 699
Reputation: 295629
Here's a rough attempt -- it's not as fast as a pure-awk
solution would be, but it's much, much faster than what you already had:
#!/bin/bash
# two external parameters: input file name, and max size in bytes (default to 3MB)
InputFile=$1
MaxFileSize=${2:-$(( 3 * 1024 * 1024 ))}
BaseName=${InputFile%.*} # strip extension
Ext=${InputFile##*.} # store extension
FileSeq=0 # start output file at sequence 0
# redirect stdin from the input file, stdout to the first output file
exec <"$InputFile" || exit
exec >"${BaseName}.${FileSeq}.${Ext}" || exit
# read the header; copy it to the first output file, and initialize CurFileSize
IFS= read -r Header || exit
printf '%s\n' "$Header" || exit
CurFileSize=$(( ${#Header} + 1 ))
# ...then loop over our inputs, and copy appropriately
while IFS= read -r line; do
if [[ $line = DWH,* ]] && (( CurFileSize > MaxFileSize )); then
(( FileSeq++ ))
exec >"${BaseName}.${FileSeq}.${Ext}" || exit
printf '%s\n' "$Header" || exit
CurFileSize=$(( ${#Header} + 1 ))
fi
printf '%s\n' "$line" || exit
(( CurFileSize += ${#line} + 1 ))
done
Noteworthy changes:
sed
, no basename
, no du
, no grep
. Any time you write $()
or ``
, there's a very nontrivial performance cost; these constructs shouldn't be used inside a tight loop unless impossible to avoid -- and when using ksh or bash extensions to the POSIX sh standard, it's rare that they are in fact impossible to avoid.>>"$filename"
every time we want to write a line, but instead use exec >"$filename"
every time we need to start a new output file.*
with a list of files in the current directory, for instance; replacing tabs with spaces; etc). When in doubt, quote more.printf '%s\n'
is better-defined by the POSIX standard than echo
-- see the standard definition for echo
, particularly the APPLICATION USAGE section.set -e
, but there are substantial caveats to its use.Test procedure and output follow:
$ cat >input.csv <<'EOF'
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
EOF
$ ./splitCSV input.csv 100 ## split at first boundary after 100 bytes
$ cat input.0.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
$ cat input.1.csv
Src,AgentNum,PhoneNum
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Upvotes: 1