Andy M
Andy M

Reputation: 119

RegEx with Curly Brackets with an extra twist

I am relatively new to Regular Expressions and I am not sure this is the way to solve my problem yet but here goes.

I have text file that might include something like this:

program A {
   int x = 10;
   tuple date {
            int day;
            int month;
            int year;
   }
}

function B {
    int y = 20;
    ...
}

process C {
    more code;
}

I need to extract any text that is between program or function or process. So there are only 3 types headers.

So I decided to use regular expression to get any text between curly brackets. The way I started is through this expression assuming I know before hand the list of identifiers:

(program|function|process)+ A[\s\S]*(?=function)

This would work perfect to capture any text in Program A. But sometimes Program A might not be followed by function. It could be followed by process or another program. Once I add an OR in my last group, it won't work correctly.

(program|function|process)+ A[\s\S]*(?=function|process|program)

The way I see it is through 3 options:

  1. Through regular expressions but is the above doable?
  2. To keep track of curly brackets but what if the input was missing one. It might be difficult to throw an error in case the matching bracket was found in another set of code.
  3. Using context-free grammar but I am leaving this option last.

Thanks in advance!

PS: I used this to help with RegExpr: http://gskinner.com/RegExr/?33i30

Upvotes: 1

Views: 275

Answers (3)

Terry Li
Terry Li

Reputation: 17268

If you prefer a regex solution, try this:

/(program|function|process).*?{(.*?)}\n+(program|function|process)/m

You may want to test it here.

The regex solution is not robust for your problem though. We have to make some assumptions before using it. For example, the code need to be well formatted. Play with it just in case it should provide you a workaround.

Update: here's the tested Java code:

public class Test {
    public static void main(String[] args) throws IOException {
        String input = FileUtils.readFileToString(new File("input.txt"));
        Pattern p = Pattern.compile("(?<=program|function|process)[^{]*\\{(.*?)\\}\\s*(?=program|function|process|$)", Pattern.DOTALL);
        Matcher m = p.matcher(input);
        while(m.find()) {
            System.out.println(m.group(1));
        }
    }
}

Upvotes: 1

Michał Ziober
Michał Ziober

Reputation: 38645

If you really don't want to use grammar, you would implement a simple parser, which can analyze file line by line:

Please, see my example:

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;

public class SourceCodeProgram {

    public static void main(String[] args) throws Exception {
        File source = new File("C:\\test.txt");
        SourceCodeScanner scanner = new SourceCodeScanner(source);
        for (Code code : scanner.readAll()) {
            System.out.println(code);
            System.out.println("-----------");
        }
    }
}

class SourceCodeScanner {

    private File source;

    private Pattern startCodePattern = Pattern.compile(
            "^(\\s)*(program|function|process)", Pattern.CASE_INSENSITIVE);

    public SourceCodeScanner(File source) {
        this.source = source;
    }

    public Collection<Code> readAll() throws Exception {
        List<String> lines = readFileLineByLine();
        List<Code> codes = new ArrayList<Code>();
        StringBuilder builder = new StringBuilder(512);

        for (String line : lines) {
            if (containsSourceCodeHeader(line)) {
                int length = builder.length();
                if (length != 0) {
                    codes.add(new Code(builder.toString().trim()));
                    builder.delete(0, length);
                }
            }
            addNextLineOfSourceCode(builder, line);
        }
        String lastCode = builder.toString();
        if (containsSourceCodeHeader(lastCode)) {
            codes.add(new Code(builder.toString().trim()));
        }
        return codes;
    }

    private boolean containsSourceCodeHeader(String line) {
        return startCodePattern.matcher(line).find();
    }

    private void addNextLineOfSourceCode(StringBuilder builder, String line) {
        builder.append(line);
        builder.append(IOUtils.LINE_SEPARATOR);
    }

    private List<String> readFileLineByLine() throws Exception {
        FileInputStream fileInputStream = new FileInputStream(source);
        return IOUtils.readLines(new BufferedInputStream(fileInputStream));
    }
}

class Code {
    private String value;

    public Code(String value) {
        this.value = value;
    }

    public String getValue() {
        return value;
    }

    @Override
    public String toString() {
        return value;
    }
}

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You should consider using a LL-parser instead of regexp for this. Regular expressions are NOT the proper answer to every parsing need, but only to regular languages. If you have a context free grammar, use a LL parser.

https://en.wikipedia.org/wiki/LL_parser

Upvotes: 4

Related Questions