Reputation: 139
I use a Dico
class to store weight of term and id of document where it appears
public class Dico
{
private String m_term; // term
private double m_weight; // weight of term
private int m_Id_doc; // id of doc that contain term
public Dico(int Id_Doc,String Term,double tf_ief )
{
this.m_Id_doc = Id_Doc;
this.m_term = Term;
this.m_weight = tf_ief;
}
public String getTerm()
{
return this.m_term;
}
public double getWeight()
{
return this.m_weight;
}
public void setWeight(double weight)
{
this.m_weight= weight;
}
public int getDocId()
{
return this.m_Id_doc;
}
}
And I use this method to calculate final weight from a Map<String,Double>
and List<Dico>
:
public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
{
// in map each term is unique but in list i have redundancy
List<Dico> list_term_weight = new ArrayList <>();
for (Map.Entry<String,Double> entrySet : map.entrySet())
{
String key = entrySet.getKey();
Double value = entrySet.getValue();
for(Dico dic : list)
{
String term =dic.getTerm();
double weight = dic.getWeight();
if(key.equals(term))
{
double new_weight =weight*value;
list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
}
}
}
return list_term_weight;
}
I have 36736 elements in the map and 1053914 in list, currently this program take lot of time to compile: BUILD SUCCESSFUL
(total time: 17 minutes 15 seconds).
How can I get only the term from the list that equals the term from map ?
Upvotes: 2
Views: 180
Reputation: 6033
In order to have the merge_list_map
function to be efficient, you need to actually use the Map
for what it is: an efficient data structure for key lookup.
As you are doing, looping on the Map
entries and looking for a match in the List
, the algorithm is O(N*M) where M is the size of the map and N the size of the list. That is certainly the worst you can get.
If you loop first through the List
and then, for each Term
, do a lookup in the the Map
with Map$get(String key)
, you will get a time complexity of O(N) since the map lookup can be considered as O(1).
In term of design, and if you can use Java8, your problem can be translated in terms of Stream
s:
public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
List<Dico> wDico = dico.stream()
.filter (d -> weights.containsKey(d.getTerm()))
.map (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
.collect (Collectors.toList());
return wDico;
}
The new weighted list is built following a logical process:
stream()
: take the list as a stream of Dico
elementsfilter()
: keep only the Dico
elements whose term
is in the weights
mapmap()
: for each filtered element, create a new Dico()
instance with the computed weight.collect()
: collect all the new instances in a new listDico
with the new weight.Performance wise, I tested it against some text, the narrative of Arthur Gordon Pym from E.A.Poe:
String text = null;
try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream()) {
text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();
}
String[] words = text.split("[\\p{Punct}\\s]+");
System.out.println(words.length); // => 108028
Since there are only 100k words in the book, for good measure, just x10 (initDico()
is a helper to build the List<Dico>
from the words):
List<Dico> dico = initDico(words);
List<Dico> bigDico = new ArrayList<>(10*dico.size());
for (int i = 0; i < 10; i++) {
bigDico.addAll(dico);
}
System.out.println(bigDico.size()); // 1080280
Build the weights map, using all words (initWeights()
builds a frequency map of the words in the book):
Map<String, Double> weights = initWeights(words);
System.out.println(weights.size()); // 9449 distinct words
The the test of merging the 1M words against the map of weights:
long start = System.currentTimeMillis();
List<Dico> wDico = merge_list_map(bigDico, weights);
long end = System.currentTimeMillis();
System.out.println("===== Elapsed time (ms): "+(end-start));
// => 105 ms
The weights map is significantly smaller than yours, but it should not impact the the timing since the lookup operations are in quasi-constant time.
This is no serious benchmark for the function, but it already shows that merge_list_map()
should score less than 1s (loading and building list and map are not part of the function).
Just to complete the exercise, following are the initialisation methods used in the test above:
private static List<Dico> initDico(String[] terms) {
List<Dico> dico = Arrays.stream(terms)
.map(String::toLowerCase)
.map(s -> new Dico(s, 1.0))
.collect(Collectors.toList());
return dico;
}
// weight of a word is the frequency*1000
private static Map<String, Double> initWeights(String[] terms) {
Map<String, Long> wfreq = termFreq(terms);
long total = wfreq.values().stream().reduce(0L, Long::sum);
return wfreq.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
}
private static Map<String, Long> termFreq(String[] terms) {
Map<String, Long> wfreq = Arrays.stream(terms)
.map(String::toLowerCase)
.collect(groupingBy(Function.identity(), counting()));
return wfreq;
}
Upvotes: 1
Reputation: 191
Use the lookup functionality of the Map, as Adam pointed out, and use HashMap as implementation of Map - HashMap lookup complexity is O(1). This should result in increased performance.
Upvotes: 0
Reputation: 36723
You can use the lookup functionality of the Map, i.e. Map.get() given that your map maps terms to weights. This should have significant performance improvements. The only difference is the output list is in the order as the input list, rather than the order the keys occur in the weighting Map.
public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
{
// in map each term is unique but in list i have redundancy
List<Dico> list_term_weight = new ArrayList<>();
for (Dico dic : list)
{
String term = dic.getTerm();
double weight = dic.getWeight();
Double value = map.get(term); // <== fetch weight from Map
if (value != null)
{
double new_weight = weight * value;
list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
}
}
return list_term_weight;
}
List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));
Output
[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]
List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
list.add(new Dico(i, "foo-" + i, i));
if (i % 3 == 0) {
weights.put("foo-" + i, (double) i); // <== every 3rd has a weight
}
}
long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();
System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast: %d ms", t2 - t1));
// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));
Output
Original: 1005 ms
Fast: 16 ms <=== loads quicker
true
Upvotes: 4
Reputation: 5168
Also, check the initialization of the Map. (http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html) The rehash of the map is costly in performance.
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
If you know, or have an approximation of the number of elements that you put in the map, you can create your Map like this:
Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);
In my experience, you can increase your performance by a factor of 2 or more.
Upvotes: 1
Reputation: 335
You should use the method contains()
for list
. In this way you'll avoid the second for
. Even if the contains()
method has O(n) complexity, you should see a a small improvement. Of course, rememeber to re-implement the equals()
. Otherwise you should use a second Map
, as bot suggested.
Upvotes: 0