Reputation: 1851

Searching one file but displaying relevant content from another using Perl

Here's the situation:

I've got two versions of a novel, both in txt format. One is in the original language and the other in Chinese or English translation.When reading the original version, it sometimes happens I want to take a quick look at the translated version of a particular sentence. What I expect is: a corresponding sentence from the translated version directly pops into my eyes when I type that particular sentence in its original language.

Here's my approach:

My orginal thinking was that since Perl knows the position of the line that matches the query (#learnt this from Chris' solution to my second post), all I need to do is let Perl use that position information to display the content of another file. But then I realized shifting from one language to another is way more complicated. One single line of content in one language may turn out to be two or even three lines in another language and the difference will build up. Then I figured brian's solution to my third question seems to be useful again. One paragraph of content in one language is likely to be contained in equally one paragraph when translated. I can just let Perl treat a paragraph as a line. Now I've come with the following code.

Here's my code:

#! perl

use warnings; use strict; 
use autodie; 
my $n;

my $file1 = "c:/FR.txt";
my $file2 = "c:/EN.txt";

print "INPUT YOUR QUERY:";
chomp(my $query=<STDIN>);

open my $fr,'<', $file1;
{ local $/="\n\n"; #learnt from brians's solution to [my 3rd question][1]  

my @fr = <$fr>;
close $fr;

for (0 .. $#fr) {   #learnt from Chris' solution to [my 2nd question][2]    

    if ($fr[$_] =~ /$query/i){
$n = $_;
}
}
}

open my $eng,'<',$file2;
{ local $/="\n\n";
my @eng = <$eng>;
close $eng;
print $eng[$n];
}

Questions are here:

1: Is this a good approach to the problem?

2: When no match is found, I will receive a warning message saying something like "Use of uninialized value" etc.. Well,it's technical and yes I know the meaning. But is it possible to change this message to something like "Oops, no match is found"?

The test files are something like:

file1

Chapitre premier

Une petite ville

La petite ville de Verrières peut passer pour l’une des plus jolies de la 

Franche-Comté....Espagnols, et maintenant ruinées.

Verrières est abrité du ... 
depuis la chute de Napoléon
 ...de presque toutes les maisons de Verrières.

à peine entre-t-on dans la ville ...
...
Eh ! elle est à M. le maire.

file2

CHAPTER 1

A Small Town

The small town of Verrieres may be regarded as one of the most
attractive....and now in
ruins.

Verrieres is sheltered ... since the fall of Napoleon, has led to the refacing
of almost all the houses in Verrieres.

No sooner has one entered the town ...Eh! It belongs to the Mayor.

If "La petite ville de" is searched, the output on screen should be:

The small 
town of Verrieres may be regarded as one of the most
attractive....and now in
ruins.

Thanks like always for any comments whatsoever :)

UPDATE1

Thanks for all the help!

Now question 2 can be solved with a few minor modifications like Chris has suggested:

if(defined $n) {
  open my $eng,'<',$file2;
  { local $/="\n\n";
    my @eng = <$eng>;
close $eng;
print $eng[$n];
}
} else {
  print "Oops, no match found!\n";
}

UPDATE2

Chris' code should run much faster than mine when dealing a huge file.

Upvotes: 1

Answers (4)

FMc

Reputation: 42421

Here's a different approach for you to consider:

use strict;
use warnings;
use File::Slurp qw(read_file);

my %para = map { $_ => Read_paragraphs("$_.txt") } qw(FR EN);

my $query = 'La petite ville de';
my @matches = 
    map  { $para{EN}[$_] }
    grep { $para{FR}[$_] =~ /$query/ }
    0 .. @{$para{FR}} - 1
;

print $_, "\n" for @matches;

sub Read_paragraphs {
    return [split /\n{2,}/, read_file(shift)];
}

Upvotes: 1

Chris Lutz

Reputation: 75469

To avoid the warning, you have to check whether or not $n is defined():

if(defined $n) {
  open my $eng,'<',$file2;
  { local $/="\n\n";
    <$eng> while --$n;
    print scalar <$eng>;
    close $eng;
  }
} else {
  print "No match found!\n";
}

I also rewrote the part that reads English. Rather than reading the entire file in and only using one line of it, it reads in a $n - 1 lines and throws them away, and then prints the next line (for real this time) it reads. This should have the same effect, but with a lower memory impact on large files. (If it doesn't, it's probably an off-by-one error because I'm tired.)

EDIT: It turns out this introduced a subtle bug. Your code to find the matching line does the same thing: slurps the file into an array, then finds the array index that matches. Let's convert this code to read line-by-line so that we don't get huge memory consumption issues:

open my $fr,'<', $file1;
{ local $/="\n\n"; 
  while(<$fr>) {
    $n = $. if /$query/i;
  }
}

I think you understand most of that: while(<$fr>) reads line-by-line from $fr and sets each line to $_ for the loop iteration, /$query/i will implicitly match against $_ (which is what we want), but you're probably curious about this little bugger: $n = $.. From perldoc perlvar:

HANDLE->input_line_number(EXPR)

$INPUT_LINE_NUMBER

$NR

$.

Current line number for the last filehandle accessed.

Each filehandle in Perl counts the number of lines that have been read from it. (Depending on the value of $/ , Perl's idea of what constitutes a line may not match yours.) When a line is read from a filehandle (via readline() or <> ), or when tell() or seek() is called on it, $. becomes an alias to the line counter for that filehandle.

You can adjust the counter by assigning to $. , but this will not actually move the seek pointer. Localizing $. will not localize the filehandle's line count. Instead, it will localize perl's notion of which filehandle $. is currently aliased to.

$. is reset when the filehandle is closed, but not when an open filehandle is reopened without an intervening close(). For more details, see "I/O Operators" in perlop. Because <> never does an explicit close, line numbers increase across ARGV files (but see examples in eof).

You can also use HANDLE->input_line_number(EXPR) to access the line counter for a given filehandle without having to worry about which handle you last accessed.

(Mnemonic: many programs use "." to mean the current line number.)

So if we found a match in your third paragraph, $. would be 3. As a general recommendation, read through the perlvar page every once in a while. There are some gems in there, and even if you don't understand what everything is for, you'll get it on a reread.

However, the final thing I have to say is that mobrule's advice about explicitly storing paragraph information is probably the best way to go. I might shy away from a homemade format, but I understand if XML or something is a little to heavyweight for your purposes. (Just know that your purposes are likely to expand greatly if you're not careful).

Upvotes: 2

mob

Reputation: 118685

Just from the data entry vantage point, splitting the files on double newlines seems like an accident (or an embarrassing off-by-one error) waiting to happen. If the concept of chapter and paragraph are the same in the two translations, you'd be safer including that information in the file. Something like ...

FR.txt --  :i,j,k  ==> Chapter i, Paragraph j, Sentence/Clause k
------
:1,1,0
Chapter premiere
:1,1,1
Une petite ville
...

EN.txt
------
...
:1,1,4
No sooner has one entered the town ...Eh! It belongs to the Mayor.
...

When you iterate through the French file, you keep track of the last piece of index information you saw when you found the right text, then you look for the same index information in the English file and print out the text that follows.

In addition to making you less vulnerable to input errors (typing an extra newline somewhere), this approach gives you additional ways to organize the data. Maybe someday you will sort the French text alphabetically to let you find text faster, while keeping the English text ordered by index to lookup text by index. Maybe in the future you will retrieve this data out of a database.

To answer your second question, it is possible to massage your warning messages, but it's not something that beginners usually try to do. It involves installing a __WARN__ handler. The perldoc for warn gives a gentle enough introduction to the concept. For your application it might look something like:

$SIG{__WARN__} = sub {
    my $msg = shift;
    if ($msg =~ /Use of uninitialized value/) {
        warn "Oops! No value was found.\n";      # ok to call "warn" inside handler
    } else {
        warn $msg;
    }
};

Upvotes: 2

user181548

Reputation:

(This is an answer to part 1 of the question only)

I've actually made a working "translated text search". I just used percentage offsets into the file. This worked for short texts but quickly broke down if the text is of any length.

my $offset = $offset_of_passage_in_text1 * length ($text2)/length ($text1);

The margin of error compared to the length of the text gets bigger and bigger. For a whole book, I don't think that approach has much hope.

One suggestion is to send the second language text to Google translate or just bung it through some kind of s/(\w+)/$dictionary{$1}/ substitution, then search for key words in the translated text to locate the likely position of the translation.

Here is a rough sketch of the code to make this work

open my $dictionary_file, "<:utf8", "name_of_file_containing_English_and_Chinese"
    or die $!;
my %dictionary;
while (<$dictionary_file>) {
     my ($english, $chinese) = split;
     $dictionary{$english} = $chinese;
}
close $dictionary_file or die $!;
my $crude_translation = $english_text;
$crude_translation =~ s/(\w+)/$dictionary{$1}/g;

I haven't tested this. The last line doesn't attempt to catch errors caused by words which are not in the dictionary.

Upvotes: 2

Searching one file but displaying relevant content from another using Perl

Answers (4)

Related Questions