Reputation: 259
I have a perl script that does some regex replacements on a text file, which I need to modify along the following lines: (a) I need to process the text as blocks of text, and then, depending on the presence/absence of one line different replacements need to be done. (b) I need to add text to the end of each block. (this transforms text from a transcription program to LaTeX code)
These are supposed to be two columns:
To the left is how the input looks, to the right what it should become:
ORIGINAL INPUT EXPECTED OUTCOME
# Single line blocks: label to be replaced and \xe added to en
txt@#Name Text text text \ex[exno=\spkr{Name}] \txt Text text text
\xe
nvb@#Name Text text text \ex[exno=\spkr{Name}] \nvb Text text text
\xe
# Multi-line blocks: labels to be replaced and \xe added to end
txt@#Name Text text text \ex[exno=\spkr{Name}] \txt Text text text
fte@#Name Text text text \freetr Text text text
\xe
txt@#Name Text text text \ex[exno=\spkr{Name}] \txt Text text text
SD (0.0) \silence{0.0}
\xe
txt@#Name Text text text \ex[exno=\spkr{Name}] \txt Text text text
tli@#Name Text text text \translit Text text text
fte@#Name Text text text \freetr Text text text
\xe
# Multi-line block that has the mrb@... line (must start with txt):
txt@#Name Text text text \ex[exno=\spkr{Name}] \begingl \glpreamble Text text text //
mrb@#Name Text text text \gla Text text text //
gle@#Name Text text text \glb Text text text //
fte@#Name Text text text \glft Text text text //
SD (0.0) \endgl
\silence{0.0}
\xe
# The tricky thing here is that (a) the labels get replaced differently, the txt line gets two commands, \begingl and \glpreamble, all lines have to end with // and they end with \endgl and \xe. In case there is an SD (silence duration) line then that needs to go between the \endgl and the \xe. (but not all have the SD).
Blocks are separated by an extra blank line. The first line of each block begins with a label txt@...
, nvb@...
or event
and may or may not be followed by subsequent lines starting with different labels. Each label needs to be replaced with something else, here accomplished through regexes like in the example below (plus some other replacements, this is just minimal for purpose of explanation). And then I need to mark the end of each block.
Furthermore, I need to have one conditional somewhere in there: If the block includes a line starting with an mrb@ label (like the sixth block above), different replacement patterns apply.
The following script is what I have, but it processes everything line by line. I know perl can do block by block, which should then make it possible to do the modifications, but unfortunately my skills are way too rudimentary to figure it out by myself.
#!/usr/bin/perl
use warnings;
use strict;
open my $fh_in, '<', $ARGV[0] or die "No input: $!";
open my $fh_out, '>', $ARGV[1] or die "No output: $!";
print $fh_out "\\begin{myenv}\n\n"; # begins group at beginning of file
while (<$fh_in>)
{
# general replacements for everything except if block includes a "mrb@" line:
s/^txt@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}] \\txt $2 /g;
s/^nvb@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}] \\txt $2 /g;
s/^tli@#\S*\s+(.*)/\\translit $1 /g;
s/^fte@#\S*\s+(.*)/\\freetr $1 /g;
s/^SD\s*\((\d*)\.(\d*)\)/\\silence{\($1\.$2\)}/g;
# after each block I need to add "\\xe"
# replacements if block includes a "mrb@" line:
s/^txt@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}] \\begingl \\glpreamble $2 \/\/ /g;
s/^mrb@#\S*\s+(.*)/\\gla $1 \/\/ /g; #
s/^gle@#\S*\s+(.*)/\\glb $1 \/\/ /g; #
s/^fte@#\S*\s+(.*)/\\glft $1 \/\/ /g; #
s/^tli@#\S*\s+(.*)/\\translit $1 \/\/ /g; #
s/^fte@#\S*\s+(.*)/\\freetr $1 \/\/ /g; #
s/^SD\s*\((\d*)\.(\d*)\)/\\silence{\($1\.$2\)}/g;
# after each block with a "mrb@" line I need to add "\\endgl" and "\\xe"
# if there is a line starting with SD at the end of the block it needs to go between "\\endgl" and "\\xe"
print $fh_out $_;
}
print $fh_out "\\end{myenv}"; # ends group
Any help much appreciated!
Upvotes: 2
Views: 612
Reputation: 66899
The processing details are apparently complex; let's first clear up how to process blocks.
One way is to go line-by-line and accumulate lines for a block, until you get to an empty line. Then you process your block and clear the buffer, and keep going. For example
use warnings;
use strict;
use feature 'say';
sub process_block {
say "Block:"; say "\t$_" for @{$_[0]};
}
my $file = shift // die "Usage: $0 filename\n"; #/
open my $fh, '<', $file or die "Can't open $file: $!";
my @block;
while (<$fh>) {
chomp;
if (not /\S/) {
if (@block) { # the first empty line
process_block(\@block);
@block = ();
}
next;
}
push @block, $_;
}
process_block(\@block) if @block; # last block may have remained
The process_block
call after the while
loop doesn't fire for the shown sample, since there are empty lines before the end of the file so the last block gets processed inside the loop. But we need to ensure that the last block is processed when there are no empty lines at the end as well.
Inside process_block
you can now check whether @block
contains mrb@#Name
, apply other (apparently complex) conditions, run regex, and print processed lines.
Here is an example, following clarifications but still leaving out some details
use List::Util qw(any); # used to be in List::MoreUtils
sub process_block {
my @block = @{ $_[0] }; # local copy, to not change @block in caller
if ($block[0] =~ /^txt\@/ and any { /^mrb\@/ } @block) {
for (@block) {
s{^txt\@#(\S*)\s+(.*)}
{\\ex[exno=\\spkr{$1}, exnoformat=X] \\begingl \\glpreamble $2 // }g; #/
s{^mrb\@#\S*\s+(.*)}{\\gla $1 // }g;
# etc
}
if ($block[-1] =~ /^\s*SD/) {
my $SD_line = pop @block;
push @block, '\endgl', $SD_line, '\xe';
}
else {
push @block, '\endgl', '\xe';
}
}
else {
for (@block) {
s/^txt\@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}, exnoformat=X] \\txt $2 /g;
s/^tli\@#\S*\s+(.*)/\\translit $1 /g;
# etc
}
push @block, '\xe';
}
say for @block;
say "\n"; # two lines to separate blocks
}
A note on efficiency.
This code processes each line in a block against all regex substitutions, to find the one that applies to it. The distinguishing pattern comes right at the beginning so "wrong" lines fail right away but we still run the regex engine for all checks for each line.
This may (or may not) be a problem with many regex or long blocks or if done often, and it can be optimized if it is slow. Since the list of substitutions is always the same we can build a hash with regex keyed by the distinguishing start of the pattern (as a dispatch table). For example
my %repl_non_mrb = (
'txt@' => sub { s/^txt\@#(\S*)\s+(.*)/\\ex[exno=\\spkr{$1}, exnoformat=X] \\txt $2 /g }
'tli@' => sub { s/^tli\@#\S*\s+(.*)/\\translit $1 /g },
...
);
my %repl_mrb = ( ... );
and then use it along the lines of
# For blocks without 'mrb@'
for (@block) {
# Capture key: up to # for 'txt@' (etc), up to \s for 'SD'. Other cases?
my ($key) = /^(.*?)(?:#|\s)/;
if ($key and exists $repl_non_mrb{$key}) {
$repl_non_mrb{$key}->(); # run the coderef
}
else { say "No processing key (?) for: $_" } # some error?
}
This clearly needs more (careful) work, while there are also other ways to organize those regex. But an implementation of these (fixed) regex substitutions hashed by their distinguishing patterns will surely improve on the O(NM) complexity of always running all regex on each line.
Another way is what you inquire about
I know perl can do block by block
what can be done by setting the $/
variable. It sets what is then used as the separator between input records. If you set it to \n\n
here you get a block served for each read, in a string
open my $fh, '<', $file or die "Can't open $file: $!";
PROCESS_FILE: {
local $/ = "\n\n";
while (my $block = <$fh>) {
chomp $block;
say "|$block|";
}
};
I put this inside a block (named PROCESS_FILE
just so) so that we can change $/
by using local. Then its previous value is restored as the block is exited and files are again read normally.
However, I don't see a benefit of doing this here since you now have a block in a scalar variable, while what you need to do seems to be line oriented. So I'd recommend the first approach.
Upvotes: 3