Reputation: 151
I have a TSV file that contains search phrases from different regions of the world. The phrases are grouped by region and sorted by descending frequency.
The third column is the region that the web search was made in (e.g. US_VA == Virginia, USA)
The second column represents the actual search phrase.
The first column represents the number of times the phrase was searched in that region.
10 shoes US_MA
9 boot US_MA
4 coat US_MA
12 hat US_TX
20 bathing suit US_CA
18 shorts US_CA
15 t shirt US_CA
10 sandals US_CA
In a bash script, I'd like to trim down the file so that it only contains the top two most popular searches for each region
for example, the output should be something like:
10 shoes US_MA
9 boot US_MA
12 hat US_TX
20 bathing suit US_CA
18 shorts US_CA
I figure that the solution involves some awk but I can't quite figure it out.
Upvotes: 0
Views: 44
Reputation: 8781
Another awk
awk ' {c=$NF; if(p!=c) { print ;t=1 } else { if(t<2) print ;t++ } p=c } ' file
with the given inputs
$ cat alec.txt
10 shoes US_MA
9 boot US_MA
4 coat US_MA
12 hat US_TX
20 bathing suit US_CA
18 shorts US_CA
15 t shirt US_CA
10 sandals US_CA
$ awk ' {c=$NF; if(p!=c) { print ;t=1 } else { if(t<2) print ;t++ } p=c } ' alec.txt
10 shoes US_MA
9 boot US_MA
12 hat US_TX
20 bathing suit US_CA
18 shorts US_CA
$
Upvotes: 0
Reputation: 247172
The answer is surprisingly tiny:
awk '++count[$NF] < 3' file.tsv
This relies on the file being sorted as described.
To send the limit as a parameter:
n=2
awk -v limit=$n '++count[$NF] <= limit' file.tsv
Upvotes: 4