Reputation: 719
I have a folder, my_folder
, which contains over 800 files named myfile_*.dat
where * is the unique ID for each file. In my file I basically have a variety of repeated fields but the one I am interested in is the <rating>
field. Lines of this field look like the following: <rating>n
where n is the rating score. I have a script which sums up all of the ratings per file, but now I must divide it by the number of lines that have <rating>n
in order to obtain an average rating per file. Here is my script:
dir=$1
cd $dir
grep -P -o '(?<=<rating>).*' * |awk -F: '{A[$1]+=$2;next}END{for(i in A){print i,A[i]}}'|sort -nr -k2
I figure that I would use grep -c <rating> myfile_*.dat
to count the number of matching lines and then divide the sum by this count per file but do not know where to put this in my script? Any suggestions are appreciated.
My script takes the folder name as an argument in the command line.
INPUT FILE
<Overall Rating>
<Avg. Price>$155
<URL>
<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5
<Author>...
repeat fields again...
Upvotes: 1
Views: 117
Reputation: 42736
Just set up another array L
to track the count of items:
grep -P -o '(?<=<rating>).*' * |
awk -F: '{A[$1]+=$2;L[$1]++;next}END{for(i in A){print i,A[i],A[i]/L[i]}}' |
sort -nr -k2
Upvotes: 2