July Elizabeth
July Elizabeth

Reputation: 159

How to print full name of the duplicate values from a text file?

I have a file similar to the following.

$ ls -1 *.ts | sort -V

media_w1805555829_b1344100_sleng_2197.ts
media_w1805555829_b1344100_sleng_2198.ts
media_w1805555829_b1344100_sleng_2199.ts
media_w1805555829_b1344100_sleng_2200.ts
media_w1501256294_b1344100_sleng_2199.ts
media_w1501256294_b1344100_sleng_2200.ts
media_w1501256294_b1344100_sleng_2201.ts
media_w1501256294_b1344100_sleng_2202.ts

This will print duplicate lines:

$ ls -1 *.ts | sort -V | grep -oP '.*_\K.*(?=.ts)' | sort | uniq -d | sed 's/^/DUPLICATE---:> /'

DUPLICATE---:> 2199
DUPLICATE---:> 2200

I want the output:

DUPLICATE---:> media_w1805555829_b1344100_sleng_2199.ts
DUPLICATE---:> media_w1805555829_b1344100_sleng_2200.ts
DUPLICATE---:> media_w1501256294_b1344100_sleng_2199.ts
DUPLICATE---:> media_w1501256294_b1344100_sleng_2200.ts

Upvotes: 1

Views: 49

Answers (2)

Timur Shtatland
Timur Shtatland

Reputation: 12465

Use this Perl one-liner:

ls -1 *.ts | perl -lne '
$cnt{$1}++ if /_(\d+).ts$/; 
push @files, [ $_, $1 ]; 
END { 
    for ( grep $cnt{$_->[1]} > 1, @files ) { 
        print "DUPLICATE---:> $_->[0]" 
    } 
}'

This eliminates the need to sort.
The %cnt hash holds the count of the suffixes (the parts of the filename that you want to find duplicates in). @files is an array of arrays. Each of its elements is an anonymous array with 2 elements: the file name and the suffix.
grep $cnt{$_->[1]} > 1, @files : The grep selects the elements of the @files array where the suffix is a dupe.

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Upvotes: 1

Raman Sailopal
Raman Sailopal

Reputation: 12917

ls -1 *.ts | sort -V | awk -F[_.] '
           { 
               map[$5]+=1;
               map1[$5][$0] 
           } 
       END { 
               for (i in map) 
                             { 
                               if(map[i]>1) 
                                          { 
                                            for (j in map1[i]) 
                                                               { 
                                                                 print "DUPLICATE---:> "j 
                                                               } 
                                           } 
                             } 
            }' | sort

One liner

ls -1 *.ts | sort -V | awk -F[_.] '{ map[$5]+=1;map1[$5][$0] } END { for (i in map) { if(map[i]>1) { for (j in map1[i]) { print "DUPLICATE---:> "j } } } }' | sort

Using awk, set the field seperator to _ or . Then create two arrays. The first (map) holds a count for each number in the file path. The second (map1) is a multi dimensional array with the first index as the number and the second as the complete line (file path). We then loop through the array map at the end and check for any counts that are greater than one. If we find any, we loop through the second map1 array and print the lines (second index) along with the additional text. We finally run through sort again to get the ordering as required,.

Upvotes: 1

Related Questions