user3250889
user3250889

Reputation: 191

Split up Java code into Tokens

I need to be able to split Java code into individual tokens, where a token is a string of code which does not depend on spaces.

For instance, the following Java code:

if (method(a, b).equals("C, C++, Java"))

would be split into:

['if', '(', 'method', '(', 'a', ',', 'b', ')', '.', 'equals', '(', '"C, C++, Java"', ')', ')'] 

Basically, I need a list of tokens that can be (un)padded with spaces without changing the execution of the code. If I take the previous code example, I can add and remove spaces around the tokens to form something like:

if   (method    ( a,b)   . equals   ( "C, C++, Java")       )

and I would still get the same result.

I'm guessing this is only possible through some external library, but I'm not aware of any.

Upvotes: 2

Views: 428

Answers (2)

GhostCat
GhostCat

Reputation: 140427

Thing is: in the end, any external library will be using standard Java libraries. So of course: you can sit down and write your own Java parser. From bottom to top.

But the real answer is: unless this is for a school project, simply don't re-invent the wheel. Of course building a parser and tokenizer is a very valuable lesson for programmers, it is also quite some work. And chances are that it will cost you days (probably weeks) even when following the approach given in the other answer (relying on parts of existing technology).

So when you asking: how to do that efficiently, look out for existing Java parsers; for example JavaParser. You see, in the real world, requirements change and evolve quickly. Today you are asked to solve the simple problem outlined in the question. But it is likely that more and more ideas what the tool should do will evolve. And sooner or later, nothing else but a full fledged parser will do. So why not start with such a thing in the first place?

Upvotes: 0

wumpz
wumpz

Reputation: 9131

Parser generators like antlr or javacc have complete java grammars as examples. You could reuse the generated tokenizer to achieve your goal.

You could achieve some kind of tokenizing using regular expressions as well. But that would not 100 percent java tokens.

Upvotes: 1

Related Questions