Reputation: 119
I am relatively new to Regular Expressions and I am not sure this is the way to solve my problem yet but here goes.
I have text file that might include something like this:
program A {
int x = 10;
tuple date {
int day;
int month;
int year;
}
}
function B {
int y = 20;
...
}
process C {
more code;
}
I need to extract any text that is between program or function or process. So there are only 3 types headers.
So I decided to use regular expression to get any text between curly brackets. The way I started is through this expression assuming I know before hand the list of identifiers:
(program|function|process)+ A[\s\S]*(?=function)
This would work perfect to capture any text in Program A. But sometimes Program A might not be followed by function. It could be followed by process or another program. Once I add an OR in my last group, it won't work correctly.
(program|function|process)+ A[\s\S]*(?=function|process|program)
The way I see it is through 3 options:
Thanks in advance!
PS: I used this to help with RegExpr: http://gskinner.com/RegExr/?33i30
Upvotes: 1
Views: 275
Reputation: 17268
If you prefer a regex solution, try this:
/(program|function|process).*?{(.*?)}\n+(program|function|process)/m
You may want to test it here.
The regex solution is not robust for your problem though. We have to make some assumptions before using it. For example, the code need to be well formatted. Play with it just in case it should provide you a workaround.
Update: here's the tested Java code:
public class Test {
public static void main(String[] args) throws IOException {
String input = FileUtils.readFileToString(new File("input.txt"));
Pattern p = Pattern.compile("(?<=program|function|process)[^{]*\\{(.*?)\\}\\s*(?=program|function|process|$)", Pattern.DOTALL);
Matcher m = p.matcher(input);
while(m.find()) {
System.out.println(m.group(1));
}
}
}
Upvotes: 1
Reputation: 38645
If you really don't want to use grammar, you would implement a simple parser, which can analyze file line by line:
Please, see my example:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.commons.io.IOUtils;
public class SourceCodeProgram {
public static void main(String[] args) throws Exception {
File source = new File("C:\\test.txt");
SourceCodeScanner scanner = new SourceCodeScanner(source);
for (Code code : scanner.readAll()) {
System.out.println(code);
System.out.println("-----------");
}
}
}
class SourceCodeScanner {
private File source;
private Pattern startCodePattern = Pattern.compile(
"^(\\s)*(program|function|process)", Pattern.CASE_INSENSITIVE);
public SourceCodeScanner(File source) {
this.source = source;
}
public Collection<Code> readAll() throws Exception {
List<String> lines = readFileLineByLine();
List<Code> codes = new ArrayList<Code>();
StringBuilder builder = new StringBuilder(512);
for (String line : lines) {
if (containsSourceCodeHeader(line)) {
int length = builder.length();
if (length != 0) {
codes.add(new Code(builder.toString().trim()));
builder.delete(0, length);
}
}
addNextLineOfSourceCode(builder, line);
}
String lastCode = builder.toString();
if (containsSourceCodeHeader(lastCode)) {
codes.add(new Code(builder.toString().trim()));
}
return codes;
}
private boolean containsSourceCodeHeader(String line) {
return startCodePattern.matcher(line).find();
}
private void addNextLineOfSourceCode(StringBuilder builder, String line) {
builder.append(line);
builder.append(IOUtils.LINE_SEPARATOR);
}
private List<String> readFileLineByLine() throws Exception {
FileInputStream fileInputStream = new FileInputStream(source);
return IOUtils.readLines(new BufferedInputStream(fileInputStream));
}
}
class Code {
private String value;
public Code(String value) {
this.value = value;
}
public String getValue() {
return value;
}
@Override
public String toString() {
return value;
}
}
Upvotes: 1
Reputation: 77454
You should consider using a LL-parser instead of regexp for this. Regular expressions are NOT the proper answer to every parsing need, but only to regular languages. If you have a context free grammar, use a LL parser.
https://en.wikipedia.org/wiki/LL_parser
Upvotes: 4