Tamojit Chatterjee
Tamojit Chatterjee

Reputation: 161

Optimizing space usage in a trie structure in a java program

I am trying to implement a trie structure in Java with 203675 words for a text editor.

Before, I was using ArrayList to store the words and that was taking 90 megabytes of space. So I want to use a trie to minimize space consumption.

Here is what I have so far, but now space consumption is 250 megabytes. What is the reason for this increase?

package TextEditor;

import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;

class Vertex {
    int words;
    Map<Character, Vertex> child;
    public Vertex() {
        words = 0;
        child = new HashMap<>();
    }
}
class Trie {
    private Vertex root;
    private InputStream openFile;
    private OutputStream openWriteFile;
    private BufferedReader readFile;
    private BufferedWriter writeFile;
    public Trie() {
        root = new Vertex();
    }
    public Trie(String path) {
         try {
            root = new Vertex();
            openFile = getClass().getResourceAsStream(path);
            readFile = new BufferedReader( new InputStreamReader(openFile));
            String in = readFile.readLine();
                    while(readFile.ready()) {
                        this.insert(in);
                    try {
                        in = readFile.readLine();
                    } catch (IOException ex) {
                        JOptionPane.showMessageDialog(null, 
                            "TRIE CONSTRUCTION ERROR!!!!");
                    }
                    }
        } catch (IOException ex) {
            JOptionPane.showMessageDialog(null, 
                "TRIE CONSTRUCTION ERROR!!!!");
        }
    }
    private void addWord(Vertex vertex, String s, int i) {
        try {
        if(i>=s.length()) {
            vertex.words += 1;
            return;
        }
        char ind  = s.charAt(i);
        if(!vertex.child.containsKey(ind)) {
            vertex.child.put(ind, new Vertex());
        }
    addWord(vertex.child.get(ind), s, i+1);
        } catch(Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }
    final void insert(String s) {
        addWord(root, s.toLowerCase(), 0);
    }
    private void DFS(Vertex v, String s, ArrayList list, 
        boolean store, String startsWith, int ind) {
    if(v != null && v.words != 0) {
            if(!store) {
                System.out.println(s);
            }
            else {
                if(s.length() >= startsWith.length()) {
                    list.add(s);
                }
            }
        }
        for (Map.Entry<Character, Vertex> entry : v.child.entrySet()) {
            Character c = entry.getKey();
            if((startsWith ==  null) || (ind>=startsWith.length()) || 
                (startsWith.charAt(ind) == c)) {
                    DFS(v.child.get(c), s + c, list, store, startsWith, ind+1);
             }
        }
    }
    public void Print() {
        DFS(root, new  String(""), null, false, null, 0);
    }
    ArrayList<String> getAsList(String startsWith) {
        ArrayList ret = new ArrayList();
        DFS(root, new  String(""), ret, true, startsWith, 0);
        return ret;
    }
    int count(Vertex  vertex, String s, int i) {
    if(i >= s.length()) {
            return vertex.words;
        }
    if(!vertex.child.containsKey(s.charAt(i))) {
            return 0;
        }
        return count(vertex.child.get(s.charAt(i)), s, i+1);
    }
    int count(String s) {   
        return count(root, s, 0);
    }
}

Is there a working example of a trie structure I can use?

Upvotes: 2

Views: 1820

Answers (2)

3xil3
3xil3

Reputation: 1586

First of all, separate the datastructure (your trie) from any code filling it. It just needs to hold the data in a structured form, and provide some basic functionality, that's it. Filling it should happen outside that datastructure itself so you can properly handle the streams. There is not a single good reason to have your trie fill itself by giving a path as a param. To clarify my first point - pulling the filling out of the trie: currently the streams gobble up a lot of memory inside the trie because they are held in private variables and the streams are never closed or destroyed. which means you keep the the file loaded in memory on top of the filled datastructure. Otherwise garbage collection can clean up those items just like using the arraylist.

Please don't reinvent the wheel and use a basic implementation such as the following. Get it working with this basic setup, and worry about improving it later.

public class Trie {

    private Map<String, Node> roots = new HashMap<>();

    public Trie() {}

    public Trie(List<String> argInitialWords) {
            for (String word:argInitialWords) {
                    addWord(word);
            }
    }

    public void addWord(String argWord) {
            addWord(argWord.toCharArray());
    }

    public void addWord(char[] argWord) {
            Node currentNode = null;

            if (!roots.containsKey(Character.toString(argWord[0]))) {
                    roots.put(Character.toString(argWord[0]), new Node(argWord[0], "" + argWord[0]));
            }

            currentNode = roots.get(Character.toString(argWord[0]));

            for (int i = 1; i < argWord.length; i++) {
                    if (currentNode.getChild(argWord[i]) == null) {
                            currentNode.addChild(new Node(argWord[i], currentNode.getValue() + argWord[i]));
                    }

                    currentNode = currentNode.getChild(argWord[i]);
            }

            currentNode.setIsWord(true);
    }

    public boolean containsPrefix(String argPrefix) {
            return contains(argPrefix.toCharArray(), false);
    }

    public boolean containsWord(String argWord) {
            return contains(argWord.toCharArray(), true);
    }

    public Node getWord(String argString) {
            Node node = getNode(argString.toCharArray());
            return node != null && node.isWord() ? node : null;
    }

    public Node getPrefix(String argString) {
            return getNode(argString.toCharArray());
    }

    @Override
    public String toString() {
            return roots.toString();
    }

    private boolean contains(char[] argString, boolean argIsWord) {
            Node node = getNode(argString);
            return (node != null && node.isWord() && argIsWord) || (!argIsWord && node != null);
    }

    private Node getNode(char[] argString) {
            Node currentNode = roots.get(Character.toString(argString[0]));

            for (int i = 1; i < argString.length && currentNode != null; i++) {
                    currentNode = currentNode.getChild(argString[i]);

                    if (currentNode == null) {
                            return null;
                    }
            }

            return currentNode;
    }
}

public class Node {

    private final Character ch;
    private final String value;
    private Map<String, Node> children = new HashMap<>();
    private boolean isValidWord;

    public Node(char argChar, String argValue) {
            ch = argChar;
            value = argValue;
    }

    public boolean addChild(Node argChild) {
            if (children.containsKey(Character.toString(argChild.getChar()))) {
                    return false;
            }

            children.put(Character.toString(argChild.getChar()), argChild);
            return true;
    }

    public boolean containsChildValue(char c) {
            return children.containsKey(Character.toString(c));
    }

    public String getValue() {
            return value.toString();
    }

    public char getChar() {
            return ch;
    }

    public Node getChild(char c) {
            return children.get(Character.toString(c));
    }

    public boolean isWord() {
            return isValidWord;
    }

    public void setIsWord(boolean argIsWord) {
            isValidWord = argIsWord;

    }

    public String toString() {
            return value;
    }

}

If you are considering memory usage improvements (at the cost of performance) you can do it in the following ways (seperate or combined)

  • by switching the object Character to it's primitive char form, this will save you the overhead of the bytes used for the object aswell as any internal private variables
  • by switching the value argument of a Node to type char[], you will save yourself another String object in each node
  • by implementing trie compression and merging the common branches. This will eliminate the need for a bunch of nodes. How many nodes will be spared will depend on the actual content entry and the simularity between the entered words. The more simular words are, the less the trie can be compressed and less nodes will be spared. And thus less memory will be freed up
  • by switching the hashmap implementation to a more memory friendly implementation (at the cost of lookup and insertion speed). The thing that would work best is a datastructure which wouldn't take more memory than it needs for holding the keys. For example: if a node is known to hold exactly 3 keys, an array of length 3 would be best for that node in terms of memory consumption. In practise, a sortedSet with a low start capacity should work better than a hashmap in terms of memory consumption because you skip the need for holding hashcodes but yet works easier to insert and search in than an array.

In general a well implemented trie, and I stress the well implemented should be about equal to the memory consumption of the 90Mb for the same dataset you are entering in it although it will depend entirely on the actual dataset.

If you managed to put together a list of words where most words aren't prefixes of any other word. Your memory usage will be far greater than with an ArrayList because you need way more nodes to represent the same thing.

If you really want to save some memory for a true random dataset, you should have a look at Burst tries, another viable alternative could be the patricia trie.

Upvotes: 0

Your use of the word "space" is ambiguous. Based on your description, it sounds like you're talking about the heap. If so, the reason for the increased memory usage is that a data structure like a trie actually does take up extra memory to store its references between nodes. An ArrayList just packs everything in, one String reference after another, and it doesn't have any additional information beyond how long the array is. A trie has lots more bookkeeping to specify the relationships between all of the nodes.

In particular, the HashMap in each vertex is going to be extremely expensive; the Sun implementation allocates enough space for a 16-entry map by default, and that requires storage for the map's own memory-allocation record, hashCodes (32-bit ints, not chars), the object wrappers for each Character...

Upvotes: 1

Related Questions