Jimmy Huch
Jimmy Huch

Reputation: 4540

How to treat file contents as String

I am creating a Scrabble game that uses a dictionary. For efficiency, instead of loading the entire dictionary (via txt file) to a Data Structure (Set, List etc.) is there any built in java class that can help me treat the contents of the file as String.

Specifically what I want to do is check whether a word made in the game is a valid word of the dictionary by doing something simple like fileName.contains (word) instead of having a huge list that is memory inefficient and using list.contains (word).

Do you guys have any idea on what I may be able to do. If the dictionary file has to be in something other than a txt file (e.g. xml file), I am open to try that as well.

NOTE: I am not looking for http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html#readFileToString%28java.io.File%29

This method is not a part of the java API.

HashSet didn't come to mind, I was stuck in the idea that all contains () methods used O(n) time, thanks to Bozho for clearing that with me, looks like I will be using a HashSet.

Upvotes: 1

Views: 651

Answers (4)

user166390
user166390

Reputation:

While a HashSet is likely a perfectly acceptable solution (see Bozho's answer), there are other data-structures that can be used including a Trie or Heap.

The advantage a Trie has is that, depending upon implementation details, the starting prefix letters can be shared (a trie is also called a "prefix tree", after all). Depending upon implementation structure and data, this may or may not actually be an improvement.

Another option, especially if file-based access is desired, is to use a Heap -- Java's PriorityQueue is actually a heap, but it is not file-based, so this would require finding/making an implementation.

All of these data-structures (and more) can be implemented to be file-based (use more IO per lookup -- which could actually be less overall -- but save memory) or implemented directly (e.g. use SQLite and let it do it's B-Tree thing). SQLite excels in that it can be a "common tool" (once used commonly ;-) in a toolbox; data importing, inspection, and modification is easy, and "it just works". SQLite is even used in less powerful systems such as Android.

HashSet comes "for free" with Java, but there is no standard Trie or file-based Heap implementation. I would start with a HashSet - Reasoning:

  1. Dictionary = 5MB.
  2. Loaded into HashSet (assuming lots of overhead) = 20MB.
  3. Memory usage in relation to other things = Minimal (assumes laptop/desktop)
  4. Time to implement with HashSet = 2 Minutes.
  5. I will have only "lost" 2 Minutes if I decide a HashSet wasn't good enough :-)

Happy coding.


Links to random data-structure implementations (may or may not be suitable):

  • TernarySearchTrie Reads in a flat file (must be specially constructed?)
  • TrieTree Has support for creating the Trie file from a flat file. Not sure if this Trie works from disk.
  • FileHash Hash which uses a file backing.
  • HashStore Another disk-based hash
  • WB B-Tree Simple B-tree implementation / "database"
  • SQLite Small embedded RDBMS.
  • UTF8String Can be used to significantly reduce the memory requirements of using HashSet<String> when using a Latin dictionary. (String in Java uses UTF-16 encoding which is minimum of two bytes/character.)

Upvotes: 3

Bozho
Bozho

Reputation: 597124

I think your best option is to load them all in memory, in a HashSet. There contains(word) is O(1).

If you are fine with having it in memory, having it as String on which to call contains(..) is much less efficient than a HashSet.

And I have to mention another option - there's a data structure to represent dictionaries - it's called Trie. You can't find an implementation in the JDK though.

A very rough calculation says that with all English words (1 million) you will need ~12 megabytes of RAM. which is a few times less than the default memory settings of the JVM. (1 million * 6 letters on average * 2 bytes per letter = 12 milion bytes, which is ~12 megabytes). (Well, perhaps a bit more to store hashes)

If you really insist on not reading it in memory, and you want to scan the file for a given word, so you can use a java.util.Scanner and its scanner.findWithHorizon(..). But that would be inefficient - I assume O(n), and I/O overhead.

Upvotes: 7

mut1na
mut1na

Reputation: 795

You need to compress your data to avoid having to store all those words. The way to do so would be a tree in which nodes are letters and leaves reflect the end of a word. This way you're not storing repetitive data such as the there these where those words all have the same prefix.

There is a way to make this solution even more memory efficient. (Hint: letter order)

Upvotes: 1

Hyperboreus
Hyperboreus

Reputation: 32429

Use the readline() of java.io.BufferedReader. That returns a string.

String line = new BufferedReader (new FileReader (file) ).readline ();

Upvotes: -1

Related Questions