Reputation: 757
Okay, so I've read of different ways of doing this, but I just want to check if there was an unseen problem with the way I've done it, or if there is a better way (perhaps grep?).
Here is my working code:
#!usr/bin/perl
use strict;
use warnings;
my $chapternumber;
open my $corpus, '<', "/Users/jon/Desktop/chpts/chpt1-8/Lifeprocessed.txt" or die $!;
while (my $sentence = <$corpus>)
{
if ($sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/ )
{
$chapternumber = $1;
$chapternumber =~ s/\./_/;
}
open my $outfile, '>>', "/Users/jon/Desktop/chpts/chpt$chapternumber.txt" or die $!;
print $outfile $sentence;
}
The file is a textbook, and I have denoted new chapters by: ~ 1.1 Organisms Have Changed over Billions of Years 1.1.
or ~ 15Intro ...
or ~ F_14
I want that to be the beginning of a new file: chpt1_1.txt (or other chpt15Intro etc....). Which ends when I find the next chapter delimiter.
1 option: Perhaps instead of line-by-line, just getting the whole block like this? :
local $/ = "~";
open...
while...
next unless ($sentenceblock =~ /\~\s([\d+F][\.I_][\d\w]+)\s/);
....
Thanks a lot.
Upvotes: 2
Views: 1460
Reputation: 63974
hm.. perhaps csplit?
Save the following into the file e.g. splitter.sh
csplit -s -f tmp - '/^~ [0-9][0-9]*\./'
ls tmp* | while read file
do
title=($(head -1 $file))
mv $file chpt${title[1]//./_}.txt
done
and use it
bash splitter.sh < book.txt
Upvotes: 1
Reputation: 3744
First, the good things:
enabled strict and warnings
using 3-arg open and lexical filehandles
checking the return value from open()
But your regex makes no sense at all.
~ is not "meta" in regexes, so it does not need escaping
. is not "meta" in a character class, so it does not need escaping
[\d+F] is equivalent to [+F\d] (what is the "F" for? + matches a literal plus character in a character class, it does NOT mean "one or more" here
[\.I_] what is the "I" for? What is the underscore for?
[\d\w] is equivalent to [\w] and even to just \w
Your code calls open() way more times that it needs to.
tr/// is better than s/// for working with individual characters.
Hopefully this will put you onto the right track:
#!/usr/bin/perl
use warnings;
use strict;
my $outfile;
while (<DATA>) {
if ( my($chapternumber) = /^~\s([\d.]+)/) {
$chapternumber =~ tr/./_/;
close $outfile if $outfile;
open $outfile, '>', "chpt$chapternumber.txt"
or die "could not open 'chpt$chapternumber.txt' $!";
}
print {$outfile} $_;
}
__DATA__
~ 1.1 Organisms Have Changed over Billions of Years 1.1
stuff
about changing
organisms
~ 1.2 Chapter One, Part Two 1.2
part two
stuff is here
Upvotes: 8
Reputation: 4070
Why not just slurp in the entire contents? Then you can just match against each chapter title. The /m
makes the ^
match against all starts of lines within the multi-line string, and the /g
matches the same pattern against all matches in the while
until no more matches appear. man perlre
.
#!/usr/bin/perl
use strict;
use warnings;
open my $corpus, '<', '/Users/jon/..../Lifeprocessed.txt' or die $!;
undef $/;
my $contents = <$corpus>;
close($corpus);
while ( $contents =~ /^\~\s([\d+F][\.I_][\d\w]+)\s/mg ) {
( my $chapternumber = $1 ) =~ s/\./_/;
open my $outfile, '>>', "/Users/jon/Desktop/chpts/chpt$chapternumber.txt" or die $!;
print $outfile $sentence;
close $outfile;
}
Upvotes: 0