neptune
neptune

Reputation: 1420

Getting dialogue snippets from text using regular expressions

I'm trying to extract snippets of dialogue from a book text. For example, if I have the string

"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."

Then I want to extract "What's the matter with the flag?" and "Seem's all right to me.".

I found a regular expression to use here, which is "[^"\\]*(\\.[^"\\]*)*". This works great in Eclipse when I'm doing a Ctrl+F find regex on my book .txt file, but when I run the following code:

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

if(m.find())
 System.out.println(m.group(1));

The only thing that prints is null. So am I not converting the regex into a Java string properly? Do I need to take into account the fact that Java Strings have a \" for the double quotes?

Upvotes: 1

Views: 781

Answers (1)

polygenelubricants
polygenelubricants

Reputation: 383866

In a natural language text, it's not likely that " is escaped by a preceding slash, so you should be able to use just the pattern "([^"]*)".

As a Java string literal, this is "\"([^\"]*)\"".

Here it is in Java:

String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

The above prints:

What's the matter with the flag?
Seems all right to me.

On escape sequences

Given this declaration:

String s = "\"";
System.out.println(s.length()); // prints "1"

The string s only has one character, ". The \ is an escape sequence present at the Java source code level; the string itself has no slash.

See also


The problem with the original code

There's actually nothing wrong with the pattern per se, but you're not capturing the right portion. \1 isn't capturing the quoted text. Here's the pattern with the correct capturing group:

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

For visual comparison, here's the original pattern, as a Java string literal:

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
                            ^^^^^^^^^^^^^^^^^
                           why capture this part?

And here's the modified pattern:

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    we want to capture this part!

As mentioned before, though: this complicated pattern isn't necessary for natural language text, which isn't likely to contain escaped quotes.

See also

Upvotes: 5

Related Questions