Reputation: 4845

Output the maximum duplicated lines in bash

I have an output of the following format in bash, from a script I wrote that returns the number of duplicate file names and the file name itself within a particular directory.

 19 prob561493
 19 prob564972
 19 prob561564
 11 prob561965
  8 prob562172
  7 prob564449
  6 prob564155
  6 prob562925
  6 prob562739

Using output | head -n1, I can get the first entry of the above output to get 19 prob561493. However, I also want to print out other problems that share the same number of max duplicates, so in this case, the final output should look this way:

  19 prob561493
  19 prob564972
  19 prob561564

I tried to do a cut -d" " | uniq -c to first get the integer of the output and then only show the unique results, but that returned ALL the duplicate results.

How can I print only the duplicated maximum duplication lines?

Upvotes: 0

Answers (5)

ghoti

Reputation: 46856

You asked how to do this in bash. I have to say that awk may provide the clearest method to achieve what you want:

awk 'NR==1{n=$1} $1==n{print;next} {exit}'

This gets the count from the first field, then prints each line with that first field, and exits when the field doesn't match. It assumes sorted input.

But the task can still be handled in bash (or even just shell) alone, without spawning extra commands or subshells.

#!/bin/sh

n=0
while read count data; do
  printf "%3d %s\n" "$count" "$data"
  if [ $n -gt 1 -a "$count" != "$lastcount" ]; then
    break
  fi
  n=$((n+1))
done

There are zillions of ways you can achieve this.

Upvotes: 1

Nir Alfasi

Reputation: 53535

Use awk to extract the '19' and grep+regex to get the lines that start with 19\b. Assuming your file-name is "output":

grep -E "$(head -n1 output | awk '{print $1}')\b" output

Upvotes: 0

anubhava

Reputation: 785316

You can use this awk:

awk 'NR==FNR{if ($1>max) max=$1; next} $1==max' file file
19 prob561493
19 prob564972
19 prob561564

In the 1st pass we are getting max value from $1 stored in variable max and in 2nd pass we just print all the records that have first field same as max.

Upvotes: 0

Pankrates

Reputation: 3094

Assuming the file is sorted numerically on the first column you can use awk for this in the following way

awk 'NR==1 {max=$1} {if($1==max){print $0}}'

this grabs the first field of the first line and stores that in variable max and only the lines that match this number are printed subsequently

Upvotes: 1

gturri

Reputation: 14609

You may first retrieve the number of max occurence, and then grep on that file:

NB=$(head -n1 error.dat | cut -d ' ' -f 1)
egrep ^$NB error.dat

Here egrep means that grep should interpret the pattern as a regex; and ^ represents the beginning of a line

Upvotes: 0

Output the maximum duplicated lines in bash

Answers (5)

Related Questions