Delete line after nth occurrence of a value in a TSV

Question

I have a TSV file that contains search phrases from different regions of the world. The phrases are grouped by region and sorted by descending frequency.

The third column is the region that the web search was made in (e.g. US_VA == Virginia, USA)

The second column represents the actual search phrase.

The first column represents the number of times the phrase was searched in that region.

10  shoes   US_MA
9   boot    US_MA
4   coat    US_MA
12  hat US_TX
20  bathing suit    US_CA
18  shorts  US_CA
15  t shirt US_CA
10  sandals US_CA

In a bash script, I'd like to trim down the file so that it only contains the top two most popular searches for each region

for example, the output should be something like:

10  shoes   US_MA
9   boot    US_MA
12  hat US_TX
20  bathing suit    US_CA
18  shorts  US_CA

I figure that the solution involves some awk but I can't quite figure it out.

glenn jackman · Accepted Answer

The answer is surprisingly tiny:

awk '++count[$NF] < 3' file.tsv

This relies on the file being sorted as described.

To send the limit as a parameter:

n=2
awk -v limit=$n '++count[$NF] <= limit' file.tsv

Delete line after nth occurrence of a value in a TSV

Answers (2)

Related Questions