user839145
user839145

Reputation: 7381

Find duplicate lines in a file and count how many time each line was duplicated?

Suppose I have a file similar to the following:

123 
123 
234 
234 
123 
345

I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:

123  3 
234  2 
345  1

Upvotes: 731

Views: 791831

Answers (8)

jubilatious1
jubilatious1

Reputation: 2341

Using Raku (formerly known as Perl_6)

awk-like syntax:

~$ raku -ne 'BEGIN my %dups; %dups{$_}++; END for %dups.kv -> $k,$v {put $k => $v};'  file

#OR:

~$ raku -ne 'BEGIN my %dups; %dups{$_}++; END put .key => .value for %dups;'  file    

Above are answers written in Raku, a member of the Perl-family of programming languages. Raku is Unicode-ready by default, and features a clean Regex syntax.

Here an awk-like syntax is invoked with the -ne (non-autoprinting, linewise) command-line flags. We BEGIN by declaring the %dups hash. The line gets loaded into the $_ topic variable, so in the body of the main loop %dups{$_}++ each line is added to the hash as key, with the (post-incremented) number-of-times seen as value. At the END of reading all lines, the %dups hash is output as (\t tab-separated) key/value pairs.

Sample Input:

123 
123 
234 
234 
123 
345

Sample Output:

123 3
345 1
234 2

NOTE 1: The question asks for duplicates, but the OP's example output includes lines that have only been seen once (hence not duplicated). If you really want duplicates-only, add a clause such as if $v > 1 (added to end of first answer above) or if .value > 1 ( added in middle of second answer above).

NOTE 2: Sometimes text lines can be a little sloppy (leading/trailing whitespace, capitalization), yet you still want them counted as one key. In that case you can clean-up text using various Raku routines such as %dups{$_.trim.lc}++ in the main body of the loop.

https://raku.org

Upvotes: 0

vineel
vineel

Reputation: 3683

In Windows, using "Windows PowerShell", I used the command mentioned below to achieve this

Get-Content .\file.txt | Group-Object | Select Name, Count

Also, we can use the where-object Cmdlet to filter the result

Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count

Upvotes: 19

Mohammed Nazim
Mohammed Nazim

Reputation: 203

To find duplicate counts, use this command:

sort filename | uniq -c | awk '{print $2, $1}'

Upvotes: 19

Andrea
Andrea

Reputation: 12964

This will print duplicate lines only, with counts:

sort FILE | uniq -cd

or, with GNU long options (on Linux):

sort FILE | uniq --count --repeated

on BSD and OSX you have to use grep to filter out unique lines:

sort FILE | uniq -c | grep -v '^ *1 '

For the given example, the result would be:

  3 123
  2 234

If you want to print counts for all lines including those that appear only once:

sort FILE | uniq -c

or, with GNU long options (on Linux):

sort FILE | uniq --count

For the given input, the output is:

  3 123
  2 234
  1 345

In order to sort the output with the most frequent lines on top, you can do the following (to get all results):

sort FILE | uniq -c | sort -nr

or, to get only duplicate lines, most frequent first:

sort FILE | uniq -cd | sort -nr

on OSX and BSD the final one becomes:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr

Upvotes: 587

αғsнιη
αғsнιη

Reputation: 2771

Via :

awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data

In awk 'dups[$1]++' command, the variable $1 holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line in data file, the node of the array named dups is incremented.

And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num].

Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :)

Upvotes: 39

wonk0
wonk0

Reputation: 13982

Assuming there is one number per line:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count

Upvotes: 1032

kenorb
kenorb

Reputation: 166795

To find and count duplicate lines in multiple files, you can try the following command:

sort <files> | uniq -c | sort -nr

or:

cat <files> | sort | uniq -c | sort -nr

Upvotes: 77

Marc B
Marc B

Reputation: 360792

Assuming you've got access to a standard Unix shell and/or cygwin environment:

tr -s ' ' '\n' < yourfile | sort | uniq -d -c
       ^--space char

Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.

Upvotes: 7

Related Questions