Reputation: 568
I have data formatted like this:
1;string1
2;string2
...
n;stringn
The first column is id-number and second contains text string. Text strings may contain numbers, letters and characters such as /.()?!
. Id-numbers are equal to line numbers. I am trying to find out duplicates in these text strings. I looking to get information like this:
String of id 1 is duplicated on lines/ids 4,6,7
String of id 2 is duplicated on lines/ids 11,25
So far I have done this using Awk command:
awk '/String of text/ {print FNR}' targetfile
And manually replaced the search string for each text string on my file. As the data sets are now larger, this is getting unpractical. Can my Awk command be improved so that it would automatically test each text string inside the file with other strings and output to information I am seeking? I though of using for-loop for this, but could not figure out how to make it work.
I could use other tool than Awk for this just as well, if there is a better solution. My system is Ubuntu 14.04.
Upvotes: 0
Views: 29
Reputation: 44023
Put this (explanation in the comments):
{ seen[$2] = seen[$2] $1 " " } # remember where you saw strings
# as string of numbers
END { # in the end
for(s in seen) { # for all strings you saw
split(seen[s], nums, " "); # split apart the line numbers again
if(length(nums) > 1) { # if you saw it more than once
line = s " is duplicated on lines"; # build the output line
for(i = 1; i <= length(nums); ++i) { # with all the line numbers where you
line = line " " nums[i] # saw it
}
print line # and print the line
}
}
}
into a file, say foo.awk
, and run awk -F \; -f foo.awk filename
You can also put it on one line like this:
awk -F \; '{ seen[$2] = seen[$2] $1 " " } END { for(s in seen) { split(seen[s], nums, " "); if(length(nums) > 1) { line = s " is duplicated in lines"; for(i = 1; i <= length(nums); ++i) { line = line " " nums[i] } print line } } }' filename
...but it's long enough that I'd use a file instead.
Upvotes: 1