Reputation: 99
I have done a method to implement the Kullback-leibler divergence in java. I have used the log with base 2 value and i am not sure whether i have used it right or i should used log base 10 value. I am using this method to measure the divergence between two text units(each of different length).
My problem is i don't get the desired divergence measure .
for example for two text units namely => "Free Ringtones" and the second one "Free Ringtones for your Mobile Phone from PremieRingtones.com"
I should get a divergence of 0.25(as of my project references) but i get a divergence of 2.0 if i use log base2 and 1.38 for log base10.
Also i am unaware of what value to substitute instead of zero value for demnominator.Plz help in giving clear explaination with some examples if possible and even some links to where i can get details.
This is My code snippet:
public Double calculateKLD(List<String> values,List<String> value2)
{
Map<String, Integer> map = new HashMap<String, Integer>();
Map<String, Integer> map2 = new HashMap<String, Integer>();
for (String sequence : values)
{
if (!map.containsKey(sequence))
{
map.put(sequence, 0);
}
map.put(sequence, map.get(sequence) + 1);
}
for (String sequence : value2)
{
if (!map2.containsKey(sequence)) {
map2.put(sequence, 0);
}
map2.put(sequence, map2.get(sequence) + 1);
}
Double result = 0.0;
Double frequency2=0.0;
for (String sequence : map.keySet())
{
Double frequency1 = (double) map.get(sequence) / values.size();
System.out.println("Freuency1 "+frequency1.toString());
if(map2.containsKey(sequence))
{
frequency2 = (double) map2.get(sequence) / value2.size();
}
result += frequency1 * (Math.log(frequency1/frequency2) / Math.log(2));
}
return result/2.4;
}
My Input is like this
First text unit
list.add("Free");list.add("Ringtones");
Second text unit
list2.add("Free");list2.add("Ringtones");list2.add("for");list2.add("your");list2.add("Mobiile");list2.add("Phone");list2.add("from");list2.add("PremieRingtones.com");
Calling function
calculateKLD(list, list2)
Upvotes: 4
Views: 3564
Reputation: 871
As a guess, you probably want to use log base e (i.e. natural logarithm). Since K-L divergence is a statistical measure, odds are that it's defined in terms of natural logarithms.
Upvotes: 2