Reputation: 16995
I'm trying to write a program that will return all the text between \begin{theorem}
and \end{theorem}
and between \begin{proof}
and \end{proof}
.
It seems natural to use regex's, but because there are a lot of potential metacharacters, they will need to be escaped.
Here's the code I have written:
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LatexTheoremProofExtractor {
// This is the LaTeX source that will be processed
private String source = null;
// These are the list of theorems and proofs that are extracted, respectively
private ArrayList<String> theorems = null;
private ArrayList<String> proofs = null;
// These are the patterns to match theorems and proofs, respectively
private static final Pattern THEOREM_REGEX = Pattern.compile("\\begin\\{theorem\\}(.+?)\\end\\{theorem\\}");
private static final Pattern PROOF_REGEX = Pattern.compile("\\begin\\{proof\\}(.+?)\\end\\{proof\\}");
LatexTheoremProofExtractor(String source) {
this.source = source;
}
public void parse() {
extractEntity("theorem");
extractEntity("proof");
}
private void extractTheorems() {
if(theorems != null) {
return;
}
theorems = new ArrayList<String>();
final Matcher matcher = THEOREM_REGEX.matcher(source);
while (matcher.find()) {
theorems.add(new String(matcher.group(1)));
}
}
private void extractProofs() {
if(proofs != null) {
return;
}
proofs = new ArrayList<String>();
final Matcher matcher = PROOF_REGEX.matcher(source);
while (matcher.find()) {
proofs.add(new String(matcher.group(1)));
}
}
private void extractEntity(final String entity) {
if(entity.equals("theorem")) {
extractTheorems();
} else if(entity.equals("proof")) {
extractProofs();
} else {
// TODO: Throw an exception or something
}
}
public ArrayList<String> getTheorems() {
return theorems;
}
}
and below is my test that fails
@Test
public void testTheoremExtractor() {
String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";
LatexTheoremProofExtractor extractor = new LatexTheoremProofExtractor(source);
extractor.parse();
ArrayList<String> theorems = extractor.getTheorems();
assertEquals(theorems.get(0).trim(), "Hello, World!");
}
Clearly my test suggests I'd like there to only be one match in this test, and it should be "Hello, World!" (post trimming).
Currently theorems
is an empty, non-null
array. Thus my Matcher
s aren't matching the pattern. Can anyone help me understand why?
Thanks, erip
Upvotes: 1
Views: 281
Reputation: 75272
There seems to be an error in your test code that the other answers don't address. You create the test string like this:
String source = "\\begin\\{theorem\\} Hello, World! \\end\\{theorem\\}";
...but in the text you say the source string is supposed to be:
\begin{theorem} Hello, World! \end{theorem}
If that's true, the string literal should be:
"\\begin{theorem} Hello, World! \\end{theorem}"
To create the regex, you would use:
Pattern.quote("\\begin{theorem}") + "(.*?)" + Pattern.quote("\\end{theorem}")
...or escape it manually:
"\\\\begin\\{theorem\\}(.*?)\\\end\\{theorem\\}"
Upvotes: 0
Reputation: 627507
Here is the update you need to make to your code - the 2 regexes in the extractor method should be changed to
private static final Pattern THEOREM_REGEX = Pattern.compile(Pattern.quote("\\begin\\{theorem\\}") + "(.+?)" + Pattern.quote("\\end\\{theorem\\}"));
private static final Pattern PROOF_REGEX = Pattern.compile(Pattern.quote("\\begin\\{proof\\}") + "(.+?)" + Pattern.quote("\\end\\{proof\\}"));
The result will be "Hello, World!". See IDEONE demo.
The string you have is actually \begin\{theorem\} Hello, World! \end\{theorem\}
. The literal backslashes in Java strings are doubled and when you need to match a literal backslash in Java with a regex, you need to use \\\\
. To avoid the backslash hell, Pattern.quote
can be of help that will tell the regex to treat all the subpattern inside it as a literal.
More details about Pattern.quote
can be found in the documentation:
Returns a literal pattern
String
for the specifiedString
.
This method produces aString
that can be used to create aPattern
that would match the strings
as if it were a literal pattern.Metacharacters or escape sequences in the input sequence will be given no special meaning.
Upvotes: 1
Reputation: 786146
Your first regex needs to be:
Pattern THEOREM_REGEX = Pattern.compile("\\\\begin\\\\\\{theorem\\\\\\}(.+?)\\\\end\\\\\\{theorem\\\\\\}");
as you're trying to match a backslash that requires \\\\ in your regex.
Upvotes: 0