user3055262
user3055262

Reputation: 405

split a file based on a pattern

I have a file which would have below pattern

HDR1|20160101|1234|
N1|ABC|
XXX|21431415|3522352352|ITEM|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|

I would like to split the file based on size but also need to take care of the below.

The first 3 lines is the header, which I need to include in every split file that I create. The line starting with FORE has relation its below lines starting with SD so I have to keep them all together.

The output should look like below.

Split File 1:

HDR1|20160101|1234|
N1|ABC|
XXX|21431415|3522352352|ITEM|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|

Split File 2:

HDR1|20160101|1234|
N1|ABC|
XXX|21431415|3522352352|ITEM|
FORE|20140508|20140214|
SD|0|0039 - data|data|data|data|
SD|0|0039 - data|data|data|data|
SD|0|0211 - data|data|data|data|

I have built a pseudo code which looks like below.There can be multiple sets of such FORE and SD which I've to keep together as a set, so I've put a loop

create $file
create $line_num=5
create $file_size
create $top_size=20mb
read the first 4 lines of the original file and copy it in a temphdr file
    Loop until last $line_num is encountered
        read the header details and Append the header from the temphdr to the $file
        for each $record starting the head -$line_num (5,6,7...etc) that contains FORE| in the first part
            if the $file size is < $top_size
                append the $record in the $file

                increment $line_num
                For each $record in head -$line_num that contains SD| in the first part
                    append the $record in the $file
                    increment $line_num 
            else
                create a $file=$file+1
            fi
        end loop
    end loop    

Could someone let me know if there is any other effective way to use awk and sed etc to implement this other than the above mentioned high level logic.

Upvotes: 0

Views: 308

Answers (3)

Walter A
Walter A

Reputation: 20002

The question would be easier with lines like

FORE|20140508|20140214|\rSD|0|0039 - data|data|data|data|\rSD|0|0211 - data|data|data|data|\rSD|0|0039 - data|data|data|data|\rSD|0|0211 - data|data|data|data|
FORE|20140508|20140214|\rSD|0|0039 - data|data|data|data|\rSD|0|0039 - data|data|data|data|\rSD|0|0211 - data|data|data|data|

First preprocess the file with awk, saving the headers in a temp file and joining lines that start with SD. Now call split -C 20m filename with additional parameters you like. Next tr "\r" "\n" into different lines and add the headers in all files.

EDIT: Preprocessing for joined lines can be done with

awk 'NR<=3 { print >> "filename.head" }
   /^FORE/ { printf("%s%s",skipFirstNewline, $0); skipFirstNewline="\n" }
   /^SD/ { printf("\r%s",$0) }
   END{printf "\n" }' filename

When you are checking the results, you will get confused by the carriage returns \r. So replace \r temporary with rr when you want to check the output.

Upvotes: 0

Charles Duffy
Charles Duffy

Reputation: 295373

Nothing nearly so complex is called for. This can be implemented in pure shell with no external commands at all (no head, awk, etc).

#!/usr/bin/env ksh

max_size=$(( 20 * 1024 * 1024 ))

# Read our three fixed header lines
headers=''
read -r line; headers+="$line"$'\n'
read -r line; headers+="$line"$'\n'
read -r line; headers+="$line"$'\n'

splitNum=1                                             # variable to track file number
splitFileName=$(printf 'split.%04d' "$splitNum")       # generate first filename
exec >"$splitFileName"                                 # and redirect stdout to that file

printf '%s' "${headers}"                               # print our headers...
cur_size=$(( ${#headers} ))                            # and set cur_size to their length

while IFS= read -r line; do                            # For each line:
  # check for and manage rotation
  if [[ $line = "FORE|"* ]]; then                      # If it's a FORE...
    if (( cur_size > max_size )); then                 # ...and over size: start a new file
      (( ++splitNum ))                                 # increment the split number
      splitFileName=$(printf 'split.%04d' "$splitNum") # generate a new filename
      exec >"$splitFileName"                           # redirect stdout to that file
      printf '%s' "${headers}"                         # print headers to stdout
      cur_size=$(( ${#headers} ))                      # reset size to size of headers
    fi
  fi
  # whether or not we had to do any of that:
  printf '%s\n' "$line"                                # print the line we just read
  cur_size=$(( cur_size + ${#line} + 1 ))              # and increment cur_size
done

Note that if you were porting this to bash, you might want to change splitFileName=$(printf 'split.%04d' "$splitNum") to printf -v splitFileName 'split.%04d' "$splitNum". ksh93 is smart enough to optimize away the subshell involved in the command substitution automatically; bash requires explicit syntax to avoid the overhead.

Upvotes: 1

anubhava
anubhava

Reputation: 785068

You can use this awk command:

awk -F '|' 'NR<=3{
   hdr = hdr $0 RS
}
$1=="FORE"{
   close(fn)
   fn="split-" ++n
   printf "%s%s", hdr, $0 RS > fn
}
$1=="SD"{
   print > fn
}
END{close(fn)}' file

In one line:

awk -F '|' 'NR<=3{hdr = hdr $0 RS} $1=="FORE"{close(fn); fn="split-" ++n; printf "%s%s", hdr, $0 RS > fn} $1=="SD"{print > fn} END{close(fn)}' file

Upvotes: 1

Related Questions