Hancy
Hancy

Reputation: 563

One awk script bug i can't fix

I have two file. domain.txt contains some domains

facebook.com
google.com
yahoo.com

site.txt contains some sites under the domains, and their URL number.

music.google.com  2
image.google.com  3
music.facebook.com  8
image.facebook.com  4
map.yahoo.com   4
new.yahoo.com   7

I'm going to select the sites, which's URL number is bigger than the average URL number it's domain have. For example, the average URL number of google.com is (2+3)/2=2.5, so the image.google.com will be picked.

I wrote awk script like this:

BEGIN {
        #read all domains into memory
        while(getline dom < "./domain.txt" > 0){
                domain[dom]=0;
        }

        #count URLs number and sites number under each domain
        for (dom in domain){
                sitenumber=0;

                close("./site.txt")
                while(getline < "./site.txt" >0){
                        if(match($1,"."dom"$")){
                                domain[dom]+=$2;
                                sitenumber++;
                                printf("%s\n",$0) >> "./sitesunderdomain";
                        } 
                }

                avgsitenumber = domain[dom]/sitenumber;
                system("cat ./sitesunderdomain") #test output

                close("./sitesunderdomain")
                while(getline < "./sitesunderdomain" >0){ #loop A
                        print "why1" #test output
                        if($2>=avgsitenumber){
                                print "why2"  #testoutput
                                print $0,avgsitenumber>>"./result"
                        }
                }
                system("> ./sitesunderdomain")
        }#for
}

then I run the awk script in the bash, get the output:

music.facebook.com  8
image.facebook.com 4
why1
why2
why1
music.google.com   2
image.google.com  3
map.yahoo.com  4
news.yahoo.com  7

And the ./result was

music.facebook.com  8  6

But as I expect, the output should be

music.facebook.com  8
image.facebook.com 4
why1
why2
why1
music.google.com   2
image.google.com  3
why1
why2
why1
map.yahoo.com  4
news.yahoo.com  7
why1
why2
why1

And the ./result should be:

music.facebook.com  8  6
image.google.com  3  2.5
news.yahoo.com  7  5.5

It seems like at loop A position, getline return 0 when dom was google.com and yahoo.com. Why?

Upvotes: 0

Views: 156

Answers (2)

ghoti
ghoti

Reputation: 46856

I'm having trouble understanding your script. There's no need to manually open files like that; awk takes care of that by itself. If your code can be fixed, I'm not the one to do it.

Here's what I came up with instead:

#!/usr/bin/awk -f

{
  domain=$1; sub(/^[a-z]*\./, "", domain);
  mean[domain]=(mean[domain]*count[domain]+$2)/++count[domain];
  score[$1]=$2;
}

END {
  printf("%7s\t%6s\t%s\n", "score", "mean", "domain");
  for (hostname in score) {
    domain=hostname; sub(/^[a-z]*\./, "", domain);
    if (score[hostname] > mean[domain]) {
      printf("%6d\t%6.2f\t%s\n", score[hostname], mean[domain], hostname);
    }
  }
}

When I run it against your data, I get the following results:

  score   mean  domain
     3    2.50  image.google.com
     8    6.00  music.facebook.com
     7    5.50  new.yahoo.com

Is that the output you're expecting?

Upvotes: 1

Birei
Birei

Reputation: 36272

Your code is a mess. That is not the way to work with awk. Awk automatically opens and reads your files line by line for you, it's not jour job using getline. That is for special cases only.

First of all:

close("./site.txt")
while(getline < "./site-test" >0){

./site-test? Your file is test.txt. It died in my test.

Second: There is no need to create files when you can reuse data directly from RAM, like with arrays.

Third: I don't like your code at all, but to fix it, close your ./sitesunderdomain temp file between reading it with getline and the deletion in system("> ./sitesunderdomain"), like:

## NOT here.
##close("./sitesunderdomain")

while(getline < "./sitesunderdomain" >0){ #loop A
        print "why1" #test output
        if($2>=avgsitenumber){
                print "why2"  #testoutput
                print $0,avgsitenumber>>"./result"
        }
}

## Better here between the read and the write.
close("./sitesunderdomain")

system("> ./sitesunderdomain")

Now run the script like:

awk -f myscript.awk domain.txt site.txt

And check output:

cat result

With following result:

music.facebook.com  8 6
image.google.com  3 2.5
new.yahoo.com   7 5.5

Upvotes: 2

Related Questions