alec
alec

Reputation: 151

Delete line after nth occurrence of a value in a TSV

I have a TSV file that contains search phrases from different regions of the world. The phrases are grouped by region and sorted by descending frequency.

The third column is the region that the web search was made in (e.g. US_VA == Virginia, USA)

The second column represents the actual search phrase.

The first column represents the number of times the phrase was searched in that region.

10  shoes   US_MA
9   boot    US_MA
4   coat    US_MA
12  hat US_TX
20  bathing suit    US_CA
18  shorts  US_CA
15  t shirt US_CA
10  sandals US_CA

In a bash script, I'd like to trim down the file so that it only contains the top two most popular searches for each region

for example, the output should be something like:

10  shoes   US_MA
9   boot    US_MA
12  hat US_TX
20  bathing suit    US_CA
18  shorts  US_CA

I figure that the solution involves some awk but I can't quite figure it out.

Upvotes: 0

Views: 44

Answers (2)

stack0114106
stack0114106

Reputation: 8781

Another awk

awk ' {c=$NF; if(p!=c) { print ;t=1 } else { if(t<2) print ;t++ } p=c } ' file

with the given inputs

$ cat alec.txt
10  shoes   US_MA
9   boot    US_MA
4   coat    US_MA
12  hat US_TX
20  bathing suit    US_CA
18  shorts  US_CA
15  t shirt US_CA
10  sandals US_CA

$ awk ' {c=$NF; if(p!=c) { print ;t=1 } else { if(t<2) print ;t++ } p=c } ' alec.txt
10  shoes   US_MA
9   boot    US_MA
12  hat US_TX
20  bathing suit    US_CA
18  shorts  US_CA

$

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 247172

The answer is surprisingly tiny:

awk '++count[$NF] < 3' file.tsv

This relies on the file being sorted as described.

To send the limit as a parameter:

n=2
awk -v limit=$n '++count[$NF] <= limit' file.tsv

Upvotes: 4

Related Questions