Finding location of duplicates from text

Question

I have data formatted like this:

1;string1
2;string2
...
n;stringn

The first column is id-number and second contains text string. Text strings may contain numbers, letters and characters such as /.()?!. Id-numbers are equal to line numbers. I am trying to find out duplicates in these text strings. I looking to get information like this:

String of id 1 is duplicated on lines/ids 4,6,7
String of id 2 is duplicated on lines/ids 11,25

So far I have done this using Awk command:

awk '/String of text/ {print FNR}' targetfile

And manually replaced the search string for each text string on my file. As the data sets are now larger, this is getting unpractical. Can my Awk command be improved so that it would automatically test each text string inside the file with other strings and output to information I am seeking? I though of using for-loop for this, but could not figure out how to make it work.

I could use other tool than Awk for this just as well, if there is a better solution. My system is Ubuntu 14.04.

Wintermute · Accepted Answer

Put this (explanation in the comments):

{ seen[$2] = seen[$2] $1 " " }               # remember where you saw strings
                                             # as string of numbers

END {                                        # in the end
  for(s in seen) {                           # for all strings you saw
    split(seen[s], nums, " ");               # split apart the line numbers again

    if(length(nums) > 1) {                   # if you saw it more than once
      line = s " is duplicated on lines";    # build the output line
      for(i = 1; i <= length(nums); ++i) {   # with all the line numbers where you 
        line = line " " nums[i]              # saw it
      }
      print line                             # and print the line
    }
  }
}

into a file, say foo.awk, and run awk -F \; -f foo.awk filename

You can also put it on one line like this:

awk -F \; '{ seen[$2] = seen[$2] $1 " " } END { for(s in seen) { split(seen[s], nums, " "); if(length(nums) > 1) { line = s " is duplicated in lines"; for(i = 1; i <= length(nums); ++i) { line = line " " nums[i] } print line } } }' filename

...but it's long enough that I'd use a file instead.

Finding location of duplicates from text

Answers (1)

Related Questions