tommy
tommy

Reputation: 139

Increase speed of composition from list and map

I use a Dico class to store weight of term and id of document where it appears

public class Dico 
{
   private String m_term; // term
   private double m_weight; // weight of term
   private int m_Id_doc; // id of doc that contain term

   public Dico(int Id_Doc,String Term,double tf_ief ) 
   {
      this.m_Id_doc = Id_Doc;
      this.m_term = Term;
      this.m_weight = tf_ief;
   }
   public String getTerm()
   {
      return this.m_term;
   }

   public double getWeight()
   {
     return this.m_weight;
   }

   public void setWeight(double weight)
   {
     this.m_weight= weight;
   }

   public int getDocId()
   {
     return this.m_Id_doc;
   }                
}

And I use this method to calculate final weight from a Map<String,Double> and List<Dico>:

 public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
 {
    // in map each term is unique but in list i have redundancy

   List<Dico> list_term_weight = new ArrayList <>();

   for (Map.Entry<String,Double> entrySet : map.entrySet())
   {
       String key = entrySet.getKey();
       Double value = entrySet.getValue();

       for(Dico dic : list)
       {    
          String term =dic.getTerm();
          double weight = dic.getWeight();

          if(key.equals(term))
          {
             double new_weight =weight*value;                
             list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
          }                  
       } 
    }
    return list_term_weight;
 }

I have 36736 elements in the map and 1053914 in list, currently this program take lot of time to compile: BUILD SUCCESSFUL (total time: 17 minutes 15 seconds).

How can I get only the term from the list that equals the term from map ?

Upvotes: 2

Views: 180

Answers (5)

T.Gounelle
T.Gounelle

Reputation: 6033

In order to have the merge_list_map function to be efficient, you need to actually use the Map for what it is: an efficient data structure for key lookup. As you are doing, looping on the Map entries and looking for a match in the List, the algorithm is O(N*M) where M is the size of the map and N the size of the list. That is certainly the worst you can get.

If you loop first through the List and then, for each Term, do a lookup in the the Map with Map$get(String key), you will get a time complexity of O(N) since the map lookup can be considered as O(1).

In term of design, and if you can use Java8, your problem can be translated in terms of Streams:

public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
    List<Dico> wDico = dico.stream()
            .filter  (d -> weights.containsKey(d.getTerm()))
            .map     (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
            .collect (Collectors.toList());
    return wDico;
}

The new weighted list is built following a logical process:

  1. stream(): take the list as a stream of Dico elements
  2. filter(): keep only the Dico elements whose term is in the weights map
  3. map(): for each filtered element, create a new Dico() instance with the computed weight.
  4. collect(): collect all the new instances in a new list
  5. return the new list that contains the filtered Dico with the new weight.

Performance wise, I tested it against some text, the narrative of Arthur Gordon Pym from E.A.Poe:

String text = null;
try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream())  {
    text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();    
}
String[] words = text.split("[\\p{Punct}\\s]+");
System.out.println(words.length); // => 108028

Since there are only 100k words in the book, for good measure, just x10 (initDico() is a helper to build the List<Dico> from the words):

List<Dico> dico = initDico(words);
List<Dico> bigDico = new ArrayList<>(10*dico.size());
for (int i = 0; i < 10; i++) {
    bigDico.addAll(dico);
}
System.out.println(bigDico.size()); // 1080280

Build the weights map, using all words (initWeights() builds a frequency map of the words in the book):

Map<String, Double> weights = initWeights(words);
System.out.println(weights.size()); // 9449 distinct words

The the test of merging the 1M words against the map of weights:

long start = System.currentTimeMillis();
List<Dico> wDico = merge_list_map(bigDico, weights);
long end = System.currentTimeMillis();
System.out.println("===== Elapsed time (ms): "+(end-start)); 
// => 105 ms

The weights map is significantly smaller than yours, but it should not impact the the timing since the lookup operations are in quasi-constant time.

This is no serious benchmark for the function, but it already shows that merge_list_map() should score less than 1s (loading and building list and map are not part of the function).


Just to complete the exercise, following are the initialisation methods used in the test above:

private static List<Dico> initDico(String[] terms) {
    List<Dico> dico = Arrays.stream(terms)
            .map(String::toLowerCase)
            .map(s -> new Dico(s, 1.0))
            .collect(Collectors.toList());
    return dico;
}

// weight of a word is the frequency*1000
private static Map<String, Double> initWeights(String[] terms) {
    Map<String, Long> wfreq = termFreq(terms);
    long total = wfreq.values().stream().reduce(0L, Long::sum);
    return wfreq.entrySet().stream()
            .collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
}

private static Map<String, Long> termFreq(String[] terms) {
    Map<String, Long> wfreq = Arrays.stream(terms)
            .map(String::toLowerCase)
            .collect(groupingBy(Function.identity(), counting()));
    return wfreq;
}

Upvotes: 1

Gustavo Zago Basilio
Gustavo Zago Basilio

Reputation: 191

Use the lookup functionality of the Map, as Adam pointed out, and use HashMap as implementation of Map - HashMap lookup complexity is O(1). This should result in increased performance.

Upvotes: 0

Adam
Adam

Reputation: 36723

You can use the lookup functionality of the Map, i.e. Map.get() given that your map maps terms to weights. This should have significant performance improvements. The only difference is the output list is in the order as the input list, rather than the order the keys occur in the weighting Map.

public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
{
    // in map each term is unique but in list i have redundancy
    List<Dico> list_term_weight = new ArrayList<>();

    for (Dico dic : list)
    {
        String term = dic.getTerm();
        double weight = dic.getWeight();

        Double value = map.get(term);  // <== fetch weight from Map
        if (value != null)
        {
            double new_weight = weight * value;

            list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));

        }
    }
    return list_term_weight;
}

Basic test

List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));

Output

[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]

Timing test - 10,000 elements

List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
    list.add(new Dico(i, "foo-" + i, i));
    if (i % 3 == 0) {
        weights.put("foo-" + i, (double) i);  // <== every 3rd has a weight
    }
}

long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();

System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast:     %d ms", t2 - t1));

// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));

Output

Original: 1005 ms
Fast:     16 ms  <=== loads quicker
true

Upvotes: 4

JFPicard
JFPicard

Reputation: 5168

Also, check the initialization of the Map. (http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html) The rehash of the map is costly in performance.

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.

If you know, or have an approximation of the number of elements that you put in the map, you can create your Map like this:

Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);

In my experience, you can increase your performance by a factor of 2 or more.

Upvotes: 1

Fabrizio Morello
Fabrizio Morello

Reputation: 335

You should use the method contains() for list. In this way you'll avoid the second for. Even if the contains() method has O(n) complexity, you should see a a small improvement. Of course, rememeber to re-implement the equals(). Otherwise you should use a second Map, as bot suggested.

Upvotes: 0

Related Questions