Katchy
Katchy

Reputation: 85

Splitting of Big File into Smaller Chunks in Shell Scripting

I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.

Sample.txt ( File will be sorted based on the third field on which pattern to be searched )

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/> 

"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 

"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt

EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 

Used

awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile

and grep commands but it was very time consuming since file is 300+ MB of size.

Upvotes: 1

Views: 201

Answers (2)

mklement0
mklement0

Reputation: 439467

Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.

It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.

ndx=0; fromRow=1
for val in '00003' '00112' '|'; do  # 2 sample values to match, plus dummy value
  chunkFile="smallfile$(( ++ndx ))"
  fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
    NR < fromRow { next }
    { if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
  ' big_file)
done

Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.


Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:

awk -F'|' -v vals='00003|00112' '
  BEGIN { split(vals, val); outFile="smallfile" ++ndx }
  { 
    if ($3 != val[ndx]) { 
      if (p) { p=0; close(outFile); outFile="smallfile" ++ndx } 
    } else { 
      p=1 
    } 
    print > outFile
  }
' big_file

Upvotes: 2

mauro
mauro

Reputation: 5950

You can try with Perl:

 perl -ne '/00003/ && print' big_file > small_file

and compare its timing with other solutions...

EDIT

Limiting my answer to the tools you didn't try already... you can also use:

sed -n '/00003/p' big_file > small_file

But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.

Upvotes: 0

Related Questions