mossaab
mossaab

Reputation: 1834

Duplicate values stored in HashMap

I have a dictionary as a text file mapping from 2M words to 50k words. I load this file into memory as HashMap<String, String> by reading the file line by line, splitting on a separator and invoking myMap.put(line[0], line[1]). The size of the text file is 45MB, while the HashMap uses 350MB of the heap. My goal is to reduce memory usage without harming lookup speed. myMap.values().size() returns 2M instead of 50k, suggesting that the values are stored as duplicates. Is there a way to make identical values point to the same String object?

Map<String, String> dict = new HashMap<>();
try (FileReader fr = new FileReader(FILE);
        BufferedReader br = new BufferedReader(fr)) {
    String line;
    while ((line = br.readLine()) != null) {
        String key_value[] = line.split(":");
        dict.put(key_value[0], key_value[1].intern());
    }
} catch (Exception e) {
    e.printStackTrace();
}

Upvotes: 1

Views: 1985

Answers (2)

joe776
joe776

Reputation: 1116

You can use String.intern() on the values to make them all point to the same instance. But this has other problems like using the PermGenSpace, which is not garbage collected pre-Java 1.7. You would call it like this: myMap.put(line[0], line[1].intern()).

Maybe a map based on a trie is more efficient, but I haven't used that, yet. Also depends on the nature of your Strings. The more alike your keys are, the more space the trie can save.

http://code.google.com/p/trie-map/

Also see Dukeling's answer concerning keys().size() and values().size() and the use of another map to avoid duplicate values.

Upvotes: 2

Bernhard Barker
Bernhard Barker

Reputation: 55639

Regardless of whether or not duplicates point to the same objects, there will still need to be references to these objects, so size should still return the size with duplicates included.

A simple example showing this.

If you want duplicates to point to the same objects, you'll have to do this outside of the HashMap or hope the optimizer takes care of it.

Alternatives to String.intern() as joe776 suggested are possibly with a self-written collection extending some Set (since Set doesn't have a Object get(Object) method) or another HashMap (having objects point to themselves) that allows you to get a reference to the common object.

Upvotes: 5

Related Questions