Johan Wikström
Johan Wikström

Reputation: 4239

Read text file in Perl word by word instead of line by line

I have a big (300 kB) text file containing words delimited by spaces. Now I want to open this file and process every word in it one by one.

The problem is that perl reads the file line by line (i.e) the entire file at once which gives me strange results. I know the normal way is to do something like

open($inFile, 'tagged.txt') or die $!;
$_ = <$inFile>;
@splitted = split(' ',$_);
print $#splitted;

But this gives me a faulty word count (too large array?).

Is it possible to read the text file word by word instead?

Upvotes: 6

Views: 19037

Answers (4)

cur4so
cur4so

Reputation: 1820

300K doesn't seem to be big, so you may try:

my $text=`cat t.txt` or die $!;
my @words = split /\s+/, $text;
foreach my $word (@words) { # process }

or slightly modified solution of squiguy

use strict;
use warnings;

my @words;
open (my $inFile, '<', 'tagged.txt') or die $!;

while (<$inFile>) {
  push(@words,split /\s+/);
}
close ($inFile);
foreach my $word (@words) { # process }

Upvotes: 1

Borodin
Borodin

Reputation: 126722

It's unclear what you input file looks like, but you imply that it contains just a single line composed of many "words".

300KB is far from a "big text file". You should read it in its entirety and pull the words from there one by one. This program demonstrates

use strict;
use warnings;

my $data = do {
  open my $fh, '<', 'data.txt' or die $!;
  local $/;
  <$fh>;
};

my $count = 0;
while ($data =~ /(\S+)/g ) {
  my $word = $1;
  ++$count;
  printf "%2d: %s\n", $count, $word;
}

output

 1: alpha
 2: beta
 3: gamma
 4: delta
 5: epsilon

Without more explanation of what a "faulty word count" might be it is very hard to help, but it is certain that the problem isn't because of the size of your array: if there was a problem there then Perl would raise an exception and die.

But if you are comparing the result with the statistics from a word processor, then it is probably because the definition of "word" is different. For instance, the word processor may consider a hyphenated word to be two words.

Upvotes: 2

RobEarl
RobEarl

Reputation: 7912

To read the file one word at a time, change the input record separator ($/) to a space:

local $/ = ' ';

Example:

#!/usr/bin/perl
use strict;
use warnings;

use feature 'say';

{
    local $/ = ' ';

    while (<DATA>) {
        say;
    }
}

__DATA__
one two three four five

Output:

one 
two 
three 
four 
five

Upvotes: 4

squiguy
squiguy

Reputation: 33370

Instead of reading it in one fell swoop, try the line-by-line approach which is easier on your machine's memory usage too (although 300 KB isn't too large for modern computers).

use strict;
use warnings;

my @words;
open (my $inFile, '<', 'tagged.txt') or die $!;

while (<$inFile>) {
  chomp;
  @words = split(' ');
  foreach my $word (@words) { # process }
}

close ($inFile);

Upvotes: 5

Related Questions