Reputation: 563
I have two file. domain.txt contains some domains
facebook.com
google.com
yahoo.com
site.txt contains some sites under the domains, and their URL number.
music.google.com 2
image.google.com 3
music.facebook.com 8
image.facebook.com 4
map.yahoo.com 4
new.yahoo.com 7
I'm going to select the sites, which's URL number is bigger than the average URL number it's domain have. For example, the average URL number of google.com is (2+3)/2=2.5, so the image.google.com will be picked.
I wrote awk script like this:
BEGIN {
#read all domains into memory
while(getline dom < "./domain.txt" > 0){
domain[dom]=0;
}
#count URLs number and sites number under each domain
for (dom in domain){
sitenumber=0;
close("./site.txt")
while(getline < "./site.txt" >0){
if(match($1,"."dom"$")){
domain[dom]+=$2;
sitenumber++;
printf("%s\n",$0) >> "./sitesunderdomain";
}
}
avgsitenumber = domain[dom]/sitenumber;
system("cat ./sitesunderdomain") #test output
close("./sitesunderdomain")
while(getline < "./sitesunderdomain" >0){ #loop A
print "why1" #test output
if($2>=avgsitenumber){
print "why2" #testoutput
print $0,avgsitenumber>>"./result"
}
}
system("> ./sitesunderdomain")
}#for
}
then I run the awk script in the bash, get the output:
music.facebook.com 8
image.facebook.com 4
why1
why2
why1
music.google.com 2
image.google.com 3
map.yahoo.com 4
news.yahoo.com 7
And the ./result was
music.facebook.com 8 6
But as I expect, the output should be
music.facebook.com 8
image.facebook.com 4
why1
why2
why1
music.google.com 2
image.google.com 3
why1
why2
why1
map.yahoo.com 4
news.yahoo.com 7
why1
why2
why1
And the ./result should be:
music.facebook.com 8 6
image.google.com 3 2.5
news.yahoo.com 7 5.5
It seems like at loop A
position, getline
return 0
when dom
was google.com and yahoo.com.
Why?
Upvotes: 0
Views: 156
Reputation: 46856
I'm having trouble understanding your script. There's no need to manually open files like that; awk takes care of that by itself. If your code can be fixed, I'm not the one to do it.
Here's what I came up with instead:
#!/usr/bin/awk -f
{
domain=$1; sub(/^[a-z]*\./, "", domain);
mean[domain]=(mean[domain]*count[domain]+$2)/++count[domain];
score[$1]=$2;
}
END {
printf("%7s\t%6s\t%s\n", "score", "mean", "domain");
for (hostname in score) {
domain=hostname; sub(/^[a-z]*\./, "", domain);
if (score[hostname] > mean[domain]) {
printf("%6d\t%6.2f\t%s\n", score[hostname], mean[domain], hostname);
}
}
}
When I run it against your data, I get the following results:
score mean domain
3 2.50 image.google.com
8 6.00 music.facebook.com
7 5.50 new.yahoo.com
Is that the output you're expecting?
Upvotes: 1
Reputation: 36272
Your code is a mess. That is not the way to work with awk
. Awk
automatically opens and reads your files line by line for you, it's not jour job using getline
. That is for special cases only.
First of all:
close("./site.txt")
while(getline < "./site-test" >0){
./site-test
? Your file is test.txt
. It died in my test.
Second: There is no need to create files when you can reuse data directly from RAM, like with arrays.
Third: I don't like your code at all, but to fix it, close your ./sitesunderdomain
temp file between reading it with getline
and the deletion in system("> ./sitesunderdomain")
, like:
## NOT here.
##close("./sitesunderdomain")
while(getline < "./sitesunderdomain" >0){ #loop A
print "why1" #test output
if($2>=avgsitenumber){
print "why2" #testoutput
print $0,avgsitenumber>>"./result"
}
}
## Better here between the read and the write.
close("./sitesunderdomain")
system("> ./sitesunderdomain")
Now run the script like:
awk -f myscript.awk domain.txt site.txt
And check output:
cat result
With following result:
music.facebook.com 8 6
image.google.com 3 2.5
new.yahoo.com 7 5.5
Upvotes: 2