How to find position of a word by using a counter?

Question

I am currently working on a code that changes certain words to Shakespearean words. I have to extract the sentences that contain the words and print them out into another file. I had to remove .START from the beginning of each file.

First I split the files with the text by spaces, so now I have the words. Next, I iterated the words through a hash. The hash keys and values are from a tab delimited file that is structured as so, OldEng/ModernEng (lc_Shakespeare_lexicon.txt). Right now, I'm trying to figure out how to find the exact position of each modern English word that is found, change it to the Shakespearean; then find the sentences with the change words and printing them out to a different file. Most of the code is finished except for this last part. Here is my code so far:

#!/usr/bin/perl -w
use diagnostics;
use strict;

#Declare variables
my $counter=();
my %hash=();
my $conv1=();
my $conv2=();
my $ssph=();
my @text=();
my $key=();
my $value=();
my $conversion=();
my @rmv=();
my $splits=();
my $words=();
my @word=();
my $vals=();
my $existingdir='/home/nelly/Desktop';
my @file='Sentences.txt'; 
my $eng_words=();
my $results=();
my $storage=();

#Open file to tab delimited words

open (FILE,"<", "lc_shakespeare_lexicon.txt") or die "could not open        lc_shakespeare_lexicon.txt
";

#split words by tabs 

while (){ 
    chomp($_);
    ($value, $key)= (split(/	/), $_);
    $hash{$value}=$key; 
}   

#open directory to Shakespearean files

my $dir="/home/nelly/Desktop/input"; 
opendir(DIR,$dir) or die "can't opendir Shakespeare_input.tar.gz";
#Use grep to get WSJ file and store into an array

my @array= grep {/WSJ/} readdir(DIR);

#store file in a scalar
foreach my $file(@array){

    #open files inside of input

    open (DATA,"<", "/home/nelly/Desktop/input/$file") or die "could not open $file
";
    #loop through each file

    while (){
        @text=$_;
        chomp(@text);
    #Remove .START
    @rmv=grep(!/.START/, @text);

foreach $splits(@rmv){
    #split data into separate words
    @word=(split(/ /, $splits));
    #Loop through each word and replace with Shakespearean word that exists
    $counter=0;

foreach $words(@word){
        if (exists $hash{$words}){
            $eng_words= $hash{$words};
            $results=$counter;
            print "$counter
";
            $counter++;

#create a new directory and store senteces with Shakespearean words in new file called "Sentences.txt"
mkdir $existingdir unless -d $existingdir; 
open my $FILE, ">>", "$existingdir/@file", or die "Can't open       $existingdir/conversion.txt'
";
#print $FILE "@words
";

close ($FILE);

                }           
            }
        }
    }   
}

close (FILE);
close (DIR);

Borodin · Accepted Answer

Natural language processing is very hard to get right except in trivial cases, for instance it is difficult to define exactly what is meant by a word or a sentence, and it is awkward to distinguish between a single quote and an apostrophe when they are both represented using the U+0027 "apostrophe" character '

Without any example data it is difficult to write a reliable solution, but the program below should be reasonably close

Please note the following

use warnings is preferable to -w on the shebang line
A program should contain as few comments as possible as long as it is comprehensible. Too many comments just make the program bigger and harder to grasp without adding any new information. The choice of identifiers should make the code mostly self documenting
I believe use diagnostics to be unnecessary. Most messages are fairly self-explanatory, and diagnostics can produce large amounts of unnecessary output
Because you are opening multiple files it is more concise to use autodie which will avoid the need to explicitly test every open call for success
It is much better to use lexical file handles, such as open my $fh ... instead of global ones, like open FH .... For one thing a lexical file handle will be implicitly closed when it goes out of scope, which helps to tidy up the program a lot by making explicit close calls unnecessary
I have removed all of the variable declarations from the top of the program except those that are non-empty. This approach is considered to be best practice as it aids debugging and assists the writing of clean code
The program lower-cases the original word using lc before checking to see if there is a matching entry in the hash. If a translation is found, then the new word is capitalised using ucfirst if the original word started with a capital letter
I have written a regular expression that will take the next sentence from the beginning of the string $content. But this is one of the things that I can't get right without sample data, and there may well be problems, for instance, with sentences that end with a closing quotation mark or a closing parenthesis

use strict;
use warnings;
use autodie;

my $lexicon      = 'lc_shakespeare_lexicon.txt';
my $dir          = '/home/nelly/Desktop/input';
my $existing_dir = '/home/nelly/Desktop';
my $sentences    = 'Sentences.txt';

my %lexicon = do {
  open my ($fh), '<', $lexicon;
  local $/;
  reverse(<$fh> =~ /[^	

]+/g);
};

my @files = do {
  opendir my ($dh), $dir;
  grep /WSJ/, readdir $dh;
};

for my $file (@files) {

  my $contents = do {
    open my $fh, '<', "$dir/$file";
    join '', grep { not /\A\.START/ } <$fh>;
  };

  # Change any CR or LF to a space, and reduce multiple spaces to single spaces
  $contents =~ tr/
/  /;
  $contents =~ s/ {2,}/ /g;

  # Find and process each sentence
  while ( $contents =~ / \s* (.+?[.?!]) (?= \s+ [A-Z] | \s* \z ) /gx ) {
    my $sentence = $1;
    my @words    = split ' ', $sentence;
    my $changed;

    for my $word (@words) {
      my $eng_word = $lexicon{lc $word};
      $eng_word = ucfirst $eng_word if $word =~ /\A[A-Z]/;
      if ($eng_word) {
        $word = $eng_word;
        ++$changed;
      }
    }

    if ($changed) {
      mkdir $existing_dir unless -d $existing_dir;
      open my $out_fh, '>>', "$existing_dir/$sentences";
      print "@words
";
    }
  }
}

How to find position of a word by using a counter?

Answers (1)

Related Questions