How to print full name of the duplicate values from a text file?

Question

I have a file similar to the following.

$ ls -1 *.ts | sort -V

media_w1805555829_b1344100_sleng_2197.ts
media_w1805555829_b1344100_sleng_2198.ts
media_w1805555829_b1344100_sleng_2199.ts
media_w1805555829_b1344100_sleng_2200.ts
media_w1501256294_b1344100_sleng_2199.ts
media_w1501256294_b1344100_sleng_2200.ts
media_w1501256294_b1344100_sleng_2201.ts
media_w1501256294_b1344100_sleng_2202.ts

This will print duplicate lines:

$ ls -1 *.ts | sort -V | grep -oP '.*_\K.*(?=.ts)' | sort | uniq -d | sed 's/^/DUPLICATE---:> /'

DUPLICATE---:> 2199
DUPLICATE---:> 2200

I want the output:

DUPLICATE---:> media_w1805555829_b1344100_sleng_2199.ts
DUPLICATE---:> media_w1805555829_b1344100_sleng_2200.ts
DUPLICATE---:> media_w1501256294_b1344100_sleng_2199.ts
DUPLICATE---:> media_w1501256294_b1344100_sleng_2200.ts

Raman Sailopal · Accepted Answer

ls -1 *.ts | sort -V | awk -F[_.] '
           { 
               map[$5]+=1;
               map1[$5][$0] 
           } 
       END { 
               for (i in map) 
                             { 
                               if(map[i]>1) 
                                          { 
                                            for (j in map1[i]) 
                                                               { 
                                                                 print "DUPLICATE---:> "j 
                                                               } 
                                           } 
                             } 
            }' | sort

One liner

ls -1 *.ts | sort -V | awk -F[_.] '{ map[$5]+=1;map1[$5][$0] } END { for (i in map) { if(map[i]>1) { for (j in map1[i]) { print "DUPLICATE---:> "j } } } }' | sort

Using awk, set the field seperator to _ or . Then create two arrays. The first (map) holds a count for each number in the file path. The second (map1) is a multi dimensional array with the first index as the number and the second as the complete line (file path). We then loop through the array map at the end and check for any counts that are greater than one. If we find any, we loop through the second map1 array and print the lines (second index) along with the additional text. We finally run through sort again to get the ordering as required,.

How to print full name of the duplicate values from a text file?

Answers (2)

Related Questions