kobame
kobame

Reputation: 5856

How to find common parts in a paths with perl?

Having several paths, like:

1: /abc/def/some/common/part/xyz/file1.ext
2: /other/path/to/7433/qwe/some/common/part/anotherfile.ext
3: /misc/path/7433/qwe/some/common/part/filexx.ext
4: /2443/totally/different/path/file9988.ext
5: /abc/another/same/path/to/ppp/thisfile.ext
6: /deep1/deep2/another/same/path/to/diffone/filename.ext

I need find the common parts - each possible ones, eg. in the above if possible to found common parts:

 /some/common/part/ - in the paths 1,2,3
 /another/same/path/to/ - in the 5,6
 /path/to/ - in the 2,5,6
 /path/ - 2,3,4,5,6

etc..

I simply absoulutely haven't any idea how to solve this - what approach is good one

When i get some direction how to solve this problem, I'm (probably) able code it myself - so don't want free programmming service - but need some guiding how to start.

I'm sure than here is already some CPAN module what could help me, but I'm really have'nt idea how to find the right useful module from the list of 30k modules for the above problem. :(

EDIT - For what i need this:

Having approx. 200k files, in 10k directories and many of them "belong together", like:

/u/some/path/project1/subprojct/file1
/u/backup/of/work/date/project1/subproject/file2
/u/backup_of_backup/of/work/date/project1/subproject/file2
/u/new/addtions/to/projec1/subproject/file3

The files are dirrerent kind (pdf, images, doc, txt and so on), several are identical (like above file2 - easy to filter with Digest::MD5), but the only way "group them together" is based on "common parts" of a path - e.g. "project1/subproject" and so on..

Another files HAS the same MD5, so can filter out duplicates, but they are in different trees, like

/u/path/some/file
/u/path/lastest_project/menu/file
/u/path/jquery/menu/file
/u/path/example/solution/jquery/menu/file

so, the files are the same, (identical md5) but need somewhat move one copy to the right place (and delete others) and need somewhat determine the "most used" common paths, and collect tags... (old path elements are tags)

The idea behind is:

And it is more complicated, but for explanation is enough the above ;)

Simply need lowering the entropy on my HDD ;)

Upvotes: 4

Views: 845

Answers (2)

shawnhcorey
shawnhcorey

Reputation: 3601

To solve this problem, you need the correct data structure. A hash that counts the partial paths works well:

use File::Spec;

my %Count_of = ();

while( <DATA> ){
  my @names = File::Spec->splitdir( $_ );

  # remove file
  pop @names;

  # if absolute path, remove empty names at start
  shift @names while length( $names[0] ) == 0;

  # don't count blank lines
  next unless @names;

  # move two cursor thru the names,
  # and count the partial parts
  # created from one to the other
  for my $i ( 0 .. $#names ){
    for my $j ( $i .. $#names ){
      my $partial_path = File::Spec->catdir( @names[ $i .. $j ] );
      $Count_of{ $partial_path } ++;
    }
  }
}

# now display the results
for my $path ( sort { $Count_of{$b} <=> $Count_of{$a} || $a cmp $b } keys %Count_of ){

  # skip if singleton.
  next if $Count_of{ $path } <= 1;

  printf "%3d : %s\n", $Count_of{ $path }, $path;
}


__DATA__
/abc/def/some/common/part/xyz/file1.ext
/other/path/to/7433/qwe/some/common/part/anotherfile.ext
/misc/path/7433/qwe/some/common/part/filexx.ext
/2443/totally/different/path/file9988.ext
/abc/another/same/path/to/ppp/thisfile.ext
/deep1/deep2/another/same/path/to/diffone/filename.ext

Upvotes: 2

dms
dms

Reputation: 817

There is some discussion about finding the longest common consecutive substrings in this thread: http://www.nntp.perl.org/group/perl.fwp/2002/02/msg1662.html

The "winner" appears to be the following code, but there are a few other things in there you could try:

#!/usr/bin/perl
use strict;
use warnings;

sub lcs {

    my $this = shift;
    my $that = shift;

    my $str = join "\0", $this, $that;
    my $len = 1;
    my $lcs;
    while ($str =~ m{ ([^\0]{$len,}) (?= [^\0]* \0 [^\0]*? \1 ) }xg) {
        $lcs = $1;
        $len = 1 + length($1);
    }

    if ($len == 1) { print("No common substring\n"); }
    else {
        print("Longest common substring of length $len: \"");
        print("$lcs");
        print("\"\n");
    }
}

Keep in mind you would have to adjust it a little bit to account for the fact that you only want entire subdirectories that match... ie, change if ($len == 1) to something like if ($len == 1 or $lcs !~ /^\// or $lcs !~ /\/$/)

You would also have to add some bookkeeping to keep track of which ones match. When I ran this code on your examples above it also found the /abc/ match in lines 1 & 5.

One thing that may or may not be a problem is that the following two lines:

/abc/another/same/path/to/ppp/thisfile.ext
/abc/another/different/path/to/ppp/otherfile.ext

Would match on:

/abc/another/

But not on:

/path/to/ppp/

But -- here's the bad news -- you will have to do O(n^2) comparisons with n=200,000 files. That could take an obscene amount of time.

Another solution would be to go through each path in your list, add all of its possible directory paths as keys to a hash and push the file itself to the hash (so that the value is an array of files that have this path in it). Something like this:

use strict;
use warnings;
my %links;

open my $fh, "<", 'filename' or die "Can't open $!";
while (my $line = <$fh>) {
    chomp($line);
    my @dirs = split /\//, $line;
    for my $i (0..$#dirs) {
        if ($i == $#dirs) {
            push(@{ $links{$dirs[$i]} }, $line);
        }
        for my $j ($i+1..$#dirs) {
            push(@{ $links{join("/",@dirs[$i..$j])} }, $line);
            #PROCESS THIS if length of array is > 1
        }
    }
}

Of course, this would take an obscene amount of memory. With 200,000 files to process, you might have a hard time no matter what you try, but maybe you can break it up into more manageable chunks. Hopefully, this will give you a starting point.

Upvotes: 2

Related Questions