How to find common parts in a paths with perl?

Question

Having several paths, like:

1: /abc/def/some/common/part/xyz/file1.ext
2: /other/path/to/7433/qwe/some/common/part/anotherfile.ext
3: /misc/path/7433/qwe/some/common/part/filexx.ext
4: /2443/totally/different/path/file9988.ext
5: /abc/another/same/path/to/ppp/thisfile.ext
6: /deep1/deep2/another/same/path/to/diffone/filename.ext

I need find the common parts - each possible ones, eg. in the above if possible to found common parts:

 /some/common/part/ - in the paths 1,2,3
 /another/same/path/to/ - in the 5,6
 /path/to/ - in the 2,5,6
 /path/ - 2,3,4,5,6

etc..

I simply absoulutely haven't any idea how to solve this - what approach is good one

string based - somewhat find common parts of a string
list based - splitting all path into lists and somewhat compare arrays for common elements
tree-graph - somewhat find a common parts of a graph
other?

When i get some direction how to solve this problem, I'm (probably) able code it myself - so don't want free programmming service - but need some guiding how to start.

I'm sure than here is already some CPAN module what could help me, but I'm really have'nt idea how to find the right useful module from the list of 30k modules for the above problem. :(

EDIT - For what i need this:

Having approx. 200k files, in 10k directories and many of them "belong together", like:

/u/some/path/project1/subprojct/file1
/u/backup/of/work/date/project1/subproject/file2
/u/backup_of_backup/of/work/date/project1/subproject/file2
/u/new/addtions/to/projec1/subproject/file3

The files are dirrerent kind (pdf, images, doc, txt and so on), several are identical (like above file2 - easy to filter with Digest::MD5), but the only way "group them together" is based on "common parts" of a path - e.g. "project1/subproject" and so on..

Another files HAS the same MD5, so can filter out duplicates, but they are in different trees, like

/u/path/some/file
/u/path/lastest_project/menu/file
/u/path/jquery/menu/file
/u/path/example/solution/jquery/menu/file

so, the files are the same, (identical md5) but need somewhat move one copy to the right place (and delete others) and need somewhat determine the "most used" common paths, and collect tags... (old path elements are tags)

The idea behind is:

if the same md5 files are mostly stored under some common path - I can make a decision where to move one copy...

And it is more complicated, but for explanation is enough the above ;)

Simply need lowering the entropy on my HDD ;)

dms · Accepted Answer

There is some discussion about finding the longest common consecutive substrings in this thread: http://www.nntp.perl.org/group/perl.fwp/2002/02/msg1662.html

The "winner" appears to be the following code, but there are a few other things in there you could try:

#!/usr/bin/perl
use strict;
use warnings;

sub lcs {

    my $this = shift;
    my $that = shift;

    my $str = join "\0", $this, $that;
    my $len = 1;
    my $lcs;
    while ($str =~ m{ ([^\0]{$len,}) (?= [^\0]* \0 [^\0]*? \1 ) }xg) {
        $lcs = $1;
        $len = 1 + length($1);
    }

    if ($len == 1) { print("No common substring
"); }
    else {
        print("Longest common substring of length $len: "");
        print("$lcs");
        print(""
");
    }
}

Keep in mind you would have to adjust it a little bit to account for the fact that you only want entire subdirectories that match... ie, change if ($len == 1) to something like if ($len == 1 or $lcs !~ /^// or $lcs !~ //$/)

You would also have to add some bookkeeping to keep track of which ones match. When I ran this code on your examples above it also found the /abc/ match in lines 1 & 5.

One thing that may or may not be a problem is that the following two lines:

/abc/another/same/path/to/ppp/thisfile.ext
/abc/another/different/path/to/ppp/otherfile.ext

Would match on:

/abc/another/

But not on:

/path/to/ppp/

But -- here's the bad news -- you will have to do O(n^2) comparisons with n=200,000 files. That could take an obscene amount of time.

Another solution would be to go through each path in your list, add all of its possible directory paths as keys to a hash and push the file itself to the hash (so that the value is an array of files that have this path in it). Something like this:

use strict;
use warnings;
my %links;

open my $fh, "<", 'filename' or die "Can't open $!";
while (my $line = <$fh>) {
    chomp($line);
    my @dirs = split ///, $line;
    for my $i (0..$#dirs) {
        if ($i == $#dirs) {
            push(@{ $links{$dirs[$i]} }, $line);
        }
        for my $j ($i+1..$#dirs) {
            push(@{ $links{join("/",@dirs[$i..$j])} }, $line);
            #PROCESS THIS if length of array is > 1
        }
    }
}

Of course, this would take an obscene amount of memory. With 200,000 files to process, you might have a hard time no matter what you try, but maybe you can break it up into more manageable chunks. Hopefully, this will give you a starting point.

How to find common parts in a paths with perl?

Answers (2)

Related Questions