Reputation: 161
I am trying to implement a trie structure in Java with 203675 words for a text editor.
Before, I was using ArrayList to store the words and that was taking 90 megabytes of space. So I want to use a trie to minimize space consumption.
Here is what I have so far, but now space consumption is 250 megabytes. What is the reason for this increase?
package TextEditor;
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;
class Vertex {
int words;
Map<Character, Vertex> child;
public Vertex() {
words = 0;
child = new HashMap<>();
}
}
class Trie {
private Vertex root;
private InputStream openFile;
private OutputStream openWriteFile;
private BufferedReader readFile;
private BufferedWriter writeFile;
public Trie() {
root = new Vertex();
}
public Trie(String path) {
try {
root = new Vertex();
openFile = getClass().getResourceAsStream(path);
readFile = new BufferedReader( new InputStreamReader(openFile));
String in = readFile.readLine();
while(readFile.ready()) {
this.insert(in);
try {
in = readFile.readLine();
} catch (IOException ex) {
JOptionPane.showMessageDialog(null,
"TRIE CONSTRUCTION ERROR!!!!");
}
}
} catch (IOException ex) {
JOptionPane.showMessageDialog(null,
"TRIE CONSTRUCTION ERROR!!!!");
}
}
private void addWord(Vertex vertex, String s, int i) {
try {
if(i>=s.length()) {
vertex.words += 1;
return;
}
char ind = s.charAt(i);
if(!vertex.child.containsKey(ind)) {
vertex.child.put(ind, new Vertex());
}
addWord(vertex.child.get(ind), s, i+1);
} catch(Exception e) {
e.printStackTrace();
System.exit(1);
}
}
final void insert(String s) {
addWord(root, s.toLowerCase(), 0);
}
private void DFS(Vertex v, String s, ArrayList list,
boolean store, String startsWith, int ind) {
if(v != null && v.words != 0) {
if(!store) {
System.out.println(s);
}
else {
if(s.length() >= startsWith.length()) {
list.add(s);
}
}
}
for (Map.Entry<Character, Vertex> entry : v.child.entrySet()) {
Character c = entry.getKey();
if((startsWith == null) || (ind>=startsWith.length()) ||
(startsWith.charAt(ind) == c)) {
DFS(v.child.get(c), s + c, list, store, startsWith, ind+1);
}
}
}
public void Print() {
DFS(root, new String(""), null, false, null, 0);
}
ArrayList<String> getAsList(String startsWith) {
ArrayList ret = new ArrayList();
DFS(root, new String(""), ret, true, startsWith, 0);
return ret;
}
int count(Vertex vertex, String s, int i) {
if(i >= s.length()) {
return vertex.words;
}
if(!vertex.child.containsKey(s.charAt(i))) {
return 0;
}
return count(vertex.child.get(s.charAt(i)), s, i+1);
}
int count(String s) {
return count(root, s, 0);
}
}
Is there a working example of a trie structure I can use?
Upvotes: 2
Views: 1820
Reputation: 1586
First of all, separate the datastructure (your trie) from any code filling it. It just needs to hold the data in a structured form, and provide some basic functionality, that's it. Filling it should happen outside that datastructure itself so you can properly handle the streams. There is not a single good reason to have your trie fill itself by giving a path as a param. To clarify my first point - pulling the filling out of the trie: currently the streams gobble up a lot of memory inside the trie because they are held in private variables and the streams are never closed or destroyed. which means you keep the the file loaded in memory on top of the filled datastructure. Otherwise garbage collection can clean up those items just like using the arraylist.
Please don't reinvent the wheel and use a basic implementation such as the following. Get it working with this basic setup, and worry about improving it later.
public class Trie {
private Map<String, Node> roots = new HashMap<>();
public Trie() {}
public Trie(List<String> argInitialWords) {
for (String word:argInitialWords) {
addWord(word);
}
}
public void addWord(String argWord) {
addWord(argWord.toCharArray());
}
public void addWord(char[] argWord) {
Node currentNode = null;
if (!roots.containsKey(Character.toString(argWord[0]))) {
roots.put(Character.toString(argWord[0]), new Node(argWord[0], "" + argWord[0]));
}
currentNode = roots.get(Character.toString(argWord[0]));
for (int i = 1; i < argWord.length; i++) {
if (currentNode.getChild(argWord[i]) == null) {
currentNode.addChild(new Node(argWord[i], currentNode.getValue() + argWord[i]));
}
currentNode = currentNode.getChild(argWord[i]);
}
currentNode.setIsWord(true);
}
public boolean containsPrefix(String argPrefix) {
return contains(argPrefix.toCharArray(), false);
}
public boolean containsWord(String argWord) {
return contains(argWord.toCharArray(), true);
}
public Node getWord(String argString) {
Node node = getNode(argString.toCharArray());
return node != null && node.isWord() ? node : null;
}
public Node getPrefix(String argString) {
return getNode(argString.toCharArray());
}
@Override
public String toString() {
return roots.toString();
}
private boolean contains(char[] argString, boolean argIsWord) {
Node node = getNode(argString);
return (node != null && node.isWord() && argIsWord) || (!argIsWord && node != null);
}
private Node getNode(char[] argString) {
Node currentNode = roots.get(Character.toString(argString[0]));
for (int i = 1; i < argString.length && currentNode != null; i++) {
currentNode = currentNode.getChild(argString[i]);
if (currentNode == null) {
return null;
}
}
return currentNode;
}
}
public class Node {
private final Character ch;
private final String value;
private Map<String, Node> children = new HashMap<>();
private boolean isValidWord;
public Node(char argChar, String argValue) {
ch = argChar;
value = argValue;
}
public boolean addChild(Node argChild) {
if (children.containsKey(Character.toString(argChild.getChar()))) {
return false;
}
children.put(Character.toString(argChild.getChar()), argChild);
return true;
}
public boolean containsChildValue(char c) {
return children.containsKey(Character.toString(c));
}
public String getValue() {
return value.toString();
}
public char getChar() {
return ch;
}
public Node getChild(char c) {
return children.get(Character.toString(c));
}
public boolean isWord() {
return isValidWord;
}
public void setIsWord(boolean argIsWord) {
isValidWord = argIsWord;
}
public String toString() {
return value;
}
}
If you are considering memory usage improvements (at the cost of performance) you can do it in the following ways (seperate or combined)
In general a well implemented trie, and I stress the well implemented should be about equal to the memory consumption of the 90Mb for the same dataset you are entering in it although it will depend entirely on the actual dataset.
If you managed to put together a list of words where most words aren't prefixes of any other word. Your memory usage will be far greater than with an ArrayList because you need way more nodes to represent the same thing.
If you really want to save some memory for a true random dataset, you should have a look at Burst tries, another viable alternative could be the patricia trie.
Upvotes: 0
Reputation: 77167
Your use of the word "space" is ambiguous. Based on your description, it sounds like you're talking about the heap. If so, the reason for the increased memory usage is that a data structure like a trie actually does take up extra memory to store its references between nodes. An ArrayList
just packs everything in, one String
reference after another, and it doesn't have any additional information beyond how long the array is. A trie has lots more bookkeeping to specify the relationships between all of the nodes.
In particular, the HashMap
in each vertex is going to be extremely expensive; the Sun implementation allocates enough space for a 16-entry map by default, and that requires storage for the map's own memory-allocation record, hashCodes
(32-bit int
s, not char
s), the object wrappers for each Character
...
Upvotes: 1