user123
user123

Reputation: 425

splitting a file into multiple files based on size and occurrence

I have a big txt file (~ 1GB) that contains specific text.

example of the file content:

is there a way in linux to split this file into multiple files based on size and occurrence at the same time ?

so for example, I want to split my file into files of 100 MBs , but where each file must begin with specific character , and the last line of the file that precede it must be the line that precede that character in the original file. noting that this character exist frequently in the original file , so the size of the split files will match.

Edit : you can download the txt file from here :[Sample File][2]

Upvotes: 0

Views: 254

Answers (1)

hootnot
hootnot

Reputation: 1014

The regexp needs a little tuning, since the resultfiles do not match completely. Run it as: perl scriptname.pl < sample.txt and you get chunk files.

#!/usr/bin/perl -w

use strict;
use IO::File;


my $all = join('', (<STDIN>));

my (@pieces) = ($all =~ m%([IZO]\(.*?\)\{.*?\r\n\}\r\n)%gsx);

my $n = 1;
my $FH;
foreach my $P (@pieces) {
   if ($P =~ m%^I%) {
      undef $FH;
      $FH = IO::File->new(sprintf("> chunk%d", $n));
      $n++;
   }
   print $FH $P;
}

Less memory hungry:

#!/usr/bin/env python

import sys

def split(filename, size=100, outputPrefix="xxx"):
    with open(filename) as I:
        n = 0
        FNM = "{}{}.txt"
        O = open(FNM.format(outputPrefix, n), "w")
        toWrite = size*1024*1024
        for line in I:
            toWrite -= len(line)
            if line[0] == 'I' and toWrite < 0:
                O.close()
                toWrite = size*1024*1024
                n += 1
                O = open(FNM.format(outputPrefix, n), "w")
            O.write(line)
        O.close()

if __name__ == "__main__":
    split(sys.argv[1])

use: python scriptname.py sample.txt all concatenated files are equale to sample.txt

Upvotes: 1

Related Questions