user8861568
user8861568

Reputation:

sort a txt file, find duplicates, but also print the lines they were found in

Given a txt file, that has the following values:

123 
123 
234 
234 
123 
345    

I use

sort FILE | uniq -cd

in order to get the number of counts each value is found. But how I could output also the row it was found?

Output:

123  3 0;1;4
234  2 2;3

The row count is zero based, thus the above numbers.

Upvotes: 1

Views: 75

Answers (3)

Cyrus
Cyrus

Reputation: 88601

awk '
{
  frequency[$1]++
  if (line[$1]=="")
  {
    line[$1]=NR-1
  }
  else
  {
    line[$1]=line[$1]";"NR-1
  }
}
END{
  for (j in frequency)
    if (frequency[j]>1)
      print j, frequency[j], line[j]
}' file

$1: content of first column

NR: current line number

Output:

234 2 2;3
123 3 0;1;4

Upvotes: 0

Paulo Scardine
Paulo Scardine

Reputation: 77251

I know the question is tagged awk/sed, but for the sake of comparison Look how much verbose is the Python version:

import sys

dictionary = {}
for i, line in enumerate(sys.stdin):
    dictionary.setdefault(line.strip(), []).append(str(i))

for value, lines_numbers in dictionary.items():
    print(value, len(line_numbers), ";".join(line_numbers))

Testing:

$ python script.py < FILE
123 3 0;1;4
234 2 2;3
345 1 5

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

awk solution:

awk '{ a[$1]=($1 in a? a[$1]";":"")(NR-1); cnt[$1]++ }
     END{ for(i in a) if(a[i]~/;/) { print i,cnt[i],a[i] } }' file
  • a[$1]=($1 in a? a[$1]";":"")(NR-1) - accumulating row numbers (starting from 0) for each grouped value $1 via concatenating multiple occurrences with ;

  • cnt[$1]++ - count numbers of value occurrences


The output:

123 3 0;1;4
234 2 2;3

Upvotes: 1

Related Questions