Reputation: 425
I have a big txt file (~ 1GB) that contains specific text.
example of the file content:
is there a way in linux to split this file into multiple files based on size and occurrence at the same time ?
so for example, I want to split my file into files of 100 MBs , but where each file must begin with specific character , and the last line of the file that precede it must be the line that precede that character in the original file. noting that this character exist frequently in the original file , so the size of the split files will match.
Edit : you can download the txt file from here :[Sample File][2]
Upvotes: 0
Views: 254
Reputation: 1014
The regexp needs a little tuning, since the resultfiles do not match completely. Run it as: perl scriptname.pl < sample.txt and you get chunk files.
#!/usr/bin/perl -w
use strict;
use IO::File;
my $all = join('', (<STDIN>));
my (@pieces) = ($all =~ m%([IZO]\(.*?\)\{.*?\r\n\}\r\n)%gsx);
my $n = 1;
my $FH;
foreach my $P (@pieces) {
if ($P =~ m%^I%) {
undef $FH;
$FH = IO::File->new(sprintf("> chunk%d", $n));
$n++;
}
print $FH $P;
}
Less memory hungry:
#!/usr/bin/env python
import sys
def split(filename, size=100, outputPrefix="xxx"):
with open(filename) as I:
n = 0
FNM = "{}{}.txt"
O = open(FNM.format(outputPrefix, n), "w")
toWrite = size*1024*1024
for line in I:
toWrite -= len(line)
if line[0] == 'I' and toWrite < 0:
O.close()
toWrite = size*1024*1024
n += 1
O = open(FNM.format(outputPrefix, n), "w")
O.write(line)
O.close()
if __name__ == "__main__":
split(sys.argv[1])
use: python scriptname.py sample.txt all concatenated files are equale to sample.txt
Upvotes: 1