Reputation: 35

Extract variables from code statement using regex

I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched. I don't know what is wrong.

here is my code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class test {
    public static void main(String[] args) {
        String test="x=y+z/n-10+my5th_integer+201";
        Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
        Matcher mymatcher = mypattern.matcher(test);    
        while (mymatcher.find()) {
            String find = mymatcher.group(1) ;
            System.out.println("variable:" + find);
        }
    }
}

Upvotes: 2

Answers (2)

Ira Baxter

Reputation: 95392

Usually processing source code with just a regex simply fails.

If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).

But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.

A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:

 [a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )

Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).

This gets more complex if the "keywords" are not always keywords, that is, they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).

Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.

For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:

a =  "x=y+z/n-10+my5th_integer+201";

There is only one identifier here. A similar problem occurs with comments that contain content that look like statements:

/* Tricky:
   a =  "x=y+z/n-10+my5th_integer+201";
*/

For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:

\u0061 = \u0062; //  means  "a=b;"

or nastier:

a\u006bc = 1; //  means "akc=1;" not "abc=1;"!

Pushing this, without Unicode character decoding, you might not even notice a string. The following is a variant of the above:

a =  \u0042x=y+z/n-10+my5th_integer+201";

To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.

If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).

You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:

a  = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;

This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.

To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.

You have one other issue: do you want to extract just variable names? What if the identifier represents a method, type, class or package? You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.

So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things that look like identifiers.

If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.

Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.

Upvotes: 3

Wiktor Stribiżew

Reputation: 627082

You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:

String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);    
while (mymatcher.find()) {
    String find = mymatcher.group(0) ;
    System.out.println("variable:" + find);
}

See IDEONE demo, the results are:

variable:x
variable:y
variable:z
variable:n
variable:my5th_integer

Upvotes: 3

Extract variables from code statement using regex

Answers (2)

Related Questions