Reputation: 405
I have a file which would have below pattern
HDR1|20160101|1234|
N1|ABC|
XXX|21431415|3522352352|ITEM|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
I would like to split the file based on size but also need to take care of the below.
The first 3 lines is the header, which I need to include in every split file that I create. The line starting with FORE has relation its below lines starting with SD so I have to keep them all together.
The output should look like below.
Split File 1:
HDR1|20160101|1234|
N1|ABC|
XXX|21431415|3522352352|ITEM|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
Split File 2:
HDR1|20160101|1234|
N1|ABC|
XXX|21431415|3522352352|ITEM|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
I have built a pseudo code which looks like below.There can be multiple sets of such FORE and SD which I've to keep together as a set, so I've put a loop
create $file
create $line_num=5
create $file_size
create $top_size=20mb
read the first 4 lines of the original file and copy it in a temphdr file
Loop until last $line_num is encountered
read the header details and Append the header from the temphdr to the $file
for each $record starting the head -$line_num (5,6,7...etc) that contains FORE| in the first part
if the $file size is < $top_size
append the $record in the $file
increment $line_num
For each $record in head -$line_num that contains SD| in the first part
append the $record in the $file
increment $line_num
else
create a $file=$file+1
fi
end loop
end loop
Could someone let me know if there is any other effective way to use awk and sed etc to implement this other than the above mentioned high level logic.
Upvotes: 0
Views: 308
Reputation: 20002
The question would be easier with lines like
FORE|20140508|20140214|\rSD|0|0039 - data|data|data|data|\rSD|0|0211 - data|data|data|data|\rSD|0|0039 - data|data|data|data|\rSD|0|0211 - data|data|data|data|
FORE|20140508|20140214|\rSD|0|0039 - data|data|data|data|\rSD|0|0039 - data|data|data|data|\rSD|0|0211 - data|data|data|data|
First preprocess the file with awk
, saving the headers in a temp file and joining lines that start with SD
.
Now call split -C 20m filename
with additional parameters you like.
Next tr "\r" "\n"
into different lines and add the headers in all files.
EDIT: Preprocessing for joined lines can be done with
awk 'NR<=3 { print >> "filename.head" }
/^FORE/ { printf("%s%s",skipFirstNewline, $0); skipFirstNewline="\n" }
/^SD/ { printf("\r%s",$0) }
END{printf "\n" }' filename
When you are checking the results, you will get confused by the carriage returns \r
. So replace \r
temporary with rr
when you want to check the output.
Upvotes: 0
Reputation: 295373
Nothing nearly so complex is called for. This can be implemented in pure shell with no external commands at all (no head
, awk
, etc).
#!/usr/bin/env ksh
max_size=$(( 20 * 1024 * 1024 ))
# Read our three fixed header lines
headers=''
read -r line; headers+="$line"$'\n'
read -r line; headers+="$line"$'\n'
read -r line; headers+="$line"$'\n'
splitNum=1 # variable to track file number
splitFileName=$(printf 'split.%04d' "$splitNum") # generate first filename
exec >"$splitFileName" # and redirect stdout to that file
printf '%s' "${headers}" # print our headers...
cur_size=$(( ${#headers} )) # and set cur_size to their length
while IFS= read -r line; do # For each line:
# check for and manage rotation
if [[ $line = "FORE|"* ]]; then # If it's a FORE...
if (( cur_size > max_size )); then # ...and over size: start a new file
(( ++splitNum )) # increment the split number
splitFileName=$(printf 'split.%04d' "$splitNum") # generate a new filename
exec >"$splitFileName" # redirect stdout to that file
printf '%s' "${headers}" # print headers to stdout
cur_size=$(( ${#headers} )) # reset size to size of headers
fi
fi
# whether or not we had to do any of that:
printf '%s\n' "$line" # print the line we just read
cur_size=$(( cur_size + ${#line} + 1 )) # and increment cur_size
done
Note that if you were porting this to bash, you might want to change splitFileName=$(printf 'split.%04d' "$splitNum")
to printf -v splitFileName 'split.%04d' "$splitNum"
. ksh93 is smart enough to optimize away the subshell involved in the command substitution automatically; bash requires explicit syntax to avoid the overhead.
Upvotes: 1
Reputation: 785068
You can use this awk
command:
awk -F '|' 'NR<=3{
hdr = hdr $0 RS
}
$1=="FORE"{
close(fn)
fn="split-" ++n
printf "%s%s", hdr, $0 RS > fn
}
$1=="SD"{
print > fn
}
END{close(fn)}' file
In one line:
awk -F '|' 'NR<=3{hdr = hdr $0 RS} $1=="FORE"{close(fn); fn="split-" ++n; printf "%s%s", hdr, $0 RS > fn} $1=="SD"{print > fn} END{close(fn)}' file
Upvotes: 1