Splitting of Big File into Smaller Chunks in Shell Scripting

Question

I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.

Sample.txt ( File will be sorted based on the third field on which pattern to be searched )

NORTH EAST|0004|00001|Fost|Weaather|
 
NORTH EAST|0004|00001|Fost|Weaather|
 
SOUTH|0003|00003|Haet|Summer|
 
SOUTH|0003|00003|Haet|Summer|
 
SOUTH|0003|00003|Haet|Summer|
 
EAST|0007|00016|uytr|kert|
 
EAST|0007|00016|uytr|kert|
 
WEST|0002|00112|WERT|fersg|
 
WEST|0002|00112|WERT|fersg|

SOUTHWEST|3456|01134|GDFSG|EWRER|

"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt

NORTH EAST|0004|00001|Fost|Weaather|
 
NORTH EAST|0004|00001|Fost|Weaather|

SOUTH|0003|00003|Haet|Summer|
 
SOUTH|0003|00003|Haet|Summer|
 
SOUTH|0003|00003|Haet|Summer|

"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt

EAST|0007|00016|uytr|kert|
 
EAST|0007|00016|uytr|kert|
 
WEST|0002|00112|WERT|fersg|
 
WEST|0002|00112|WERT|fersg|

Used

awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile

and grep commands but it was very time consuming since file is 300+ MB of size.

mklement0 · Accepted Answer

Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.

It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile, where is an index starting with 1.

ndx=0; fromRow=1
for val in '00003' '00112' '|'; do  # 2 sample values to match, plus dummy value
  chunkFile="smallfile$(( ++ndx ))"
  fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
    NR < fromRow { next }
    { if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
  ' big_file)
done

Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.

Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:

awk -F'|' -v vals='00003|00112' '
  BEGIN { split(vals, val); outFile="smallfile" ++ndx }
  { 
    if ($3 != val[ndx]) { 
      if (p) { p=0; close(outFile); outFile="smallfile" ++ndx } 
    } else { 
      p=1 
    } 
    print > outFile
  }
' big_file

Splitting of Big File into Smaller Chunks in Shell Scripting

Answers (2)

Related Questions