Reputation: 292
Please I am a little new to this field so pardon me if the question sound trivial or basic.
I have a group of dataset(Bag of words to be specific) and I need to generate a proximity matrix by using their edit distance from each other to find and generate the proximity matrix .
I am however quite confused how I will keep track of my data/strings in the matrix. I need the proximity matrix for the purpose of clustering.
Or How generally do you approach this kinds of problem in the field. I am using perl and R to implement this.
Here is a typical code in perl I have written that reads from a text file containing my bag of words
use strict ;
use warnings ;
use Text::Levenshtein qw(distance) ;
main(@ARGV);
sub main
{
my @TokenDistances ;
my $Tokenfile = 'TokenDistinct.txt';
my @Token ;
my $AppendingCount = 0 ;
my @Tokencompare ;
my %Levcount = ();
open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!");
while(<FH>)
{
chomp $_;
$_ =~ s/^(\s+)$//g;
push (@Token , $_ );
}
close(FH);
@Tokencompare = @Token ;
foreach my $tokenWord(@Tokencompare)
{
my $lengthoffile = scalar @Tokencompare;
my $i = 0 ;
chomp $tokenWord ;
#@TokenDistances = levDistance($tokenWord , \@Tokencompare );
for($i = 0 ; $i < $lengthoffile ;$i++)
{
if(scalar @TokenDistances == scalar @Tokencompare)
{
print "Yipeeeeeeeeeeeeeeeeeeeee\n";
}
chomp $tokenWord ;
chomp $Tokencompare[$i];
#print $tokenWord. " {$Tokencompare[$i]} " . " $TokenDistances[$i] " . "\n";
#$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i];
$Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] );
}
StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount);
$AppendingCount++;
%Levcount = () ;
}
# %Levcount = ();
}
sub levDistance
{
my $string1 = shift ;
#my @StringList = @{(shift)};
my $string2 = shift ;
return distance($string1 , $string2);
}
sub StoreSortedValues {
my $Levcount = shift;
my $tokenWordTopMost = ${(shift)} ;
my $j = ${(shift)};
my @ListToken;
my $Tokenfile = 'LevResult.txt';
if($j == 0 )
{
open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!");
}
else
{
open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!");
}
print $tokenWordTopMost;
my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}};
@ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} } keys %tokenWordMaster;
#@ListToken = keys %tokenWordMaster;
print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n";
#print FH map {"$_ \t=> $tokenWordMaster{$_} \n "} @ListToken;
foreach my $tokey (@ListToken)
{
print FH "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n"
}
close(FH) or die ("Error Closing File. $!");
}
the problem is how can I represent the proximity matrix from this and still be able to keep track of which comparison represent which in my matrix.
Upvotes: 6
Views: 2412
Reputation: 179578
In the RecordLinkage
package there is the levenshteinDist
function, which is one way of calculating an edit distance between strings.
install.packages("RecordLinkage")
library(RecordLinkage)
Set up some data:
fruit <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry",
"Blackcurrant", "Blueberry", "Currant", "Cherry")
Now create a matrix consisting of zeros to reserve memory for the distance table. Then use nested for
loops to calculate the individual distances. We end with a matrix with a row and a column for each fruit. Thus we can rename the columns and rows to be identical to the original vector.
fdist <- matrix(rep(0, length(fruit)^2), ncol=length(fruit))
for(i in seq_along(fruit)){
for(j in seq_along(fruit)){
fdist[i, j] <- levenshteinDist(fruit[i], fruit[j])
}
}
rownames(fdist) <- colnames(fdist) <- fruit
The results:
fdist
Apple Apricot Avocado Banana Bilberry Blackberry Blackcurrant
Apple 0 5 6 6 7 9 12
Apricot 5 0 6 7 8 10 10
Avocado 6 6 0 6 8 9 10
Banana 6 7 6 0 7 8 8
Bilberry 7 8 8 7 0 4 9
Blackberry 9 10 9 8 4 0 5
Blackcurrant 12 10 10 8 9 5 0
Blueberry 8 9 9 8 3 3 8
Currant 7 5 6 5 8 10 6
Cherry 6 7 7 6 4 6 10
Upvotes: 7
Reputation: 20570
The proximity or similarity (or dissimilarity) matrix is just a table that stores the similarity score for pairs of objects. So, if you have N objects, then the R code can be simMat <- matrix(nrow = N, ncol = N)
, and then each entry, (i,j), of simMat
indicates the similarity between item i and item j.
In R, you can use several packages, including vwr
, to calculate the Levenshtein edit distance.
You may also find this Wikibook to be of interest: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
Upvotes: 2