YukiMiku
YukiMiku

Reputation: 31

how to remove calculating the original string in frequency calculation using perl

I have a trouble with frequency calculation of list of strings read from a text file using Perl. What I want to do is, calculate the number of characters in each line separately. My code does work on it, but it calculates the original string as a single character too.

Here is my code,

sub cal_frequency{
    while (my @row = <$fd>){
    chomp;
    for (my $i=0; $i<=147; $i++){
    my %count;
            print "$row[$i]\n";
            foreach my $str (split //, $row[$i]) {
                $count{$str}++;
            }
            foreach my $str (sort keys %count) {
                printf "%-31s %s\n", $str, $count{$str} ;
            }
        }
    }   
}

This is the output I get.

HMHHMSHHHNHHMEHFFHHHHHHHDHDE
                               1
D                               2
E                               2
F                               2
H                               17
M                               3
N                               1
S                               1
LLLWLFLLWLWLLWWLLLLFLLLLWLFW
                               1
F                               3
L                               18
W                               7
TTTTTDTTTTTTTTTTSTTTTSSTTATT
                               1
A                               1
D                               1
...
...
...

Does anyone can help me to figure out what's wrong with the code.

Upvotes: 3

Views: 62

Answers (2)

Borodin
Borodin

Reputation: 126742

it calculates the original string as a single character too.

If you're talking about the lines in your output like

HMHHMSHHHNHHMEHFFHHHHHHHDHDE

then that's because you print each line with

print "$row[$i]\n"

Here's how I would code your problem. In particular I'm passing a file name to the subroutine, and using a global regex pattern to extract a list all non-space characters so that chomp is unnecessary

I'm wondering if you really want a dump of frequencies for each line in the input, or if all the lines should be taken together. This code treats each line separately as your own code does. I've printed a separator line between each block of output to make them distinct

use strict;
use warnings 'all';

cal_frequency('cal_freq.txt');

sub cal_frequency {

    my ($file) = @_;

    open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};

    while ( <$fh> ) {

        my %count;

        ++$count{$_} for /\S/g;

        printf "%-2s %s\n", $_, $count{$_} for sort keys %count;

        print "---\n";
    }
}

output

D  2
E  2
F  2
H  17
M  3
N  1
S  1
---
F  3
L  18
W  7
---
A  1
D  1
S  3
T  23
---

Upvotes: 0

simbabque
simbabque

Reputation: 54333

You are chomping the wrong variable. You are reading your input into @row, but then you just call chomp; without an argument. That chomps $_, not @row.

while (my @row = <$fd>){
    chomp @row;
    # ...

You could have found that out yourself. If you output the actual content of %count for each of your input lines, you'll see that there is a newline \n that's being counted once. So this first line you are seeing

HMHHMSHHHNHHMEHFFHHHHHHHDHDE
                               1

is actually not the full string. It's the linebreak at the end of the string. Try with this. I suggest you change your iteration variable to only do one row so you're not getting flooded with output.

for ( my $i = 0; $i <= 1; $i++ ) {

    # ...
    foreach my $str ( sort keys %count ) {
        printf "%-31s %s\n", $str, $count{$str};
    }
    use Data::Dumper;
    print Dumper \%count;
}

You'll now see this:

$VAR1 = {
          'A' => 1,
          'S' => 3,
          'T' => 23,
          '
' => 1,

And there is the linebreak.


Your code is a bit odd. You're doing a lot of unnecessary things. I'll try to explain them for you to simplify your code.

Your while loop only gets run once because you're slurping the whole file into @row at once.

sub cal_frequency {
    while ( my @row = <DATA> ) {
        print "while iteration\n";

I omitted the printf from the output here.

while iteration
HMHHMSHHHNHHMEHFFHHHHHHHDHDE
LLLWLFLLWLWLLWWLLLLFLLLLWLFW
TTTTTDTTTTTTTTTTSTTTTSSTTATT

As you can see, there is only one while iteration. If you have a very long file, it's smarter to read it line by line.

while ( my $row = <DATA> ) {
    chomp $row;
    # ...
}

Once you do that, your C-style for loop becomes obsolete. In fact, that loop coupled the program to the input because you had the number of lines of input hard-coded there. Without this loop, the program can deal with arbitrarily long files (like the three lines you showed us and that I am using).

It's also good style to declare variables as late as possible and as close as possible to where you need them. I've moved my %count and changed indentation.

sub cal_frequency {
    while ( my $row = <DATA> ) {
        chomp $row;
        print "$row\n";

        my %count;
        foreach my $str ( split //, $row ) {
            $count{$str}++;
        }
        foreach my $str ( sort keys %count ) {
            printf "%-31s %s\n", $str, $count{$str};
        }
    }
}

It's now much clearer what's going on.

In your code you are using a lexical filehandle. That's great. But it seems to be created outside of the subroutine. It would be better to pass it as an argument.

Finally, the choice of variable name $str in your two foreach loops is confusing. When I see string, I think of words or other things that are long. But here you specifically want to break your string (the row) into characters. That's strings with a length of one. So name them characters.

Here's the final program.

use strict;
use warnings;

sub cal_frequency {
    my ($fh) = @_;

    while ( my $row = <$fh> ) {
        chomp $row;
        print "$row\n";

        my %count;
        foreach my $chr ( split //, $row ) {
            $count{$chr}++;
        }
        foreach my $chr ( sort keys %count ) {
            printf "%-31s %s\n", $chr, $count{$chr};
        }
    }
}

cal_frequency(\*DATA);

__DATA__
HMHHMSHHHNHHMEHFFHHHHHHHDHDE
LLLWLFLLWLWLLWWLLLLFLLLLWLFW
TTTTTDTTTTTTTTTTSTTTTSSTTATT

And the output.

HMHHMSHHHNHHMEHFFHHHHHHHDHDE
D                               2
E                               2
F                               2
H                               17
M                               3
N                               1
S                               1
LLLWLFLLWLWLLWWLLLLFLLLLWLFW
F                               3
L                               18
W                               7
TTTTTDTTTTTTTTTTSTTTTSSTTATT
A                               1
D                               1
S                               3
T                               23

Upvotes: 8

Related Questions