Reputation: 5126

Tokenizing source code in Java

For a systems software development course, I'm working on a complete assembler for an instructor-invented assembly language. Currently I'm working on the tokenizer. While doing some searching, I've come across the Java StringTokenizer class...but I see that it has been essentially deprecated. It seems far easier to use, however, than the String.split method with regular expressions.

Is there some reason that I should avoid using it? Is there perhaps something else within the typical Java libraries that would suit this task well that I am not aware of?

EDIT: Giving more detail.

The reason I am considering String.split complicated is that my knowledge of regular expressions is roughly that I know of them. While it would be helpful for my general knowledge as a software developer to know them, I'm not sure that I want to invest the time right now, especially if there is an easier alternative present.

In terms of my usage of the tokenizer: it will go through a text file containing assembly code and break it into tokens, passing the text and token type to a parser. Delimiters include white space (spaces, tabs, newlines), the comment-start character '|' (which can occur on its own line, or after other text), and the comma to separate operands in an instruction.

I would write that more mathematically, but my knowledge of formal languages is a bit rusty.

EDIT 2: Asking question more clearly

I have seen the documentation on the StringTokenizer class. It would have suited my purposes well, but its use is discouraged. Other than String.split, is there something within the standard java libraries that would be helpful?

Upvotes: 4

Answers (5)

crowne

Reputation: 8534

Don't fear the regex, get yourself a regex editor such as the following eclipse plugin,
http://brosinski.com/regex/update and you'll be able to test the expressions without compiling or even before writing your program.

If you need more reference, here are some very useful sites :

Although I think the suggestion above of using JavaCC sound like the right approach.
Another option would be ANTLR.

Heres a post comparing the experience of ANTLR vs JavaCC.

Upvotes: 1

Kdeveloper

Reputation: 13819

If what you're building is an assembler, I would use JavaCC for building the parser/compiler.

Upvotes: 3

Klark

Reputation: 8280

Something is deprecated when there is a better alternative, or those methods are dangerous in some situations. So the answer is - Yep, you can use it, but there is a better way to achieve what you need.

Btw, what is complicate about split?

Upvotes: 0

Tim Frey

Reputation: 9941

I believe that the java.util.Scanner class has replaced StringTokenizer. Scanner let's you handle tokens one at a time, whereas String.split() will split the entire string (which could be large, if you're parsing a source code file). Using Scanner, you can examine each token, decide what action to take, then discard that token.

Upvotes: 3

Zak

Reputation: 25205

From the documentation:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

The following example illustrates how the String.split method can be used to break up a string into its basic tokens:

     String[] result = "this is a test".split("\\s");
     for (int x=0; x<result.length; x++)
         System.out.println(result[x]);

prints the following output:

     this
     is
     a
     test

Upvotes: 2

Tokenizing source code in Java

Answers (5)

Related Questions