Dove
Dove

Reputation: 1267

Regex Question - One or more spaces outside of a quote enclosed block of text

I want to be replace any occurrence of more than one space with a single space, but take no action in text between quotes.

Is there any way of doing this with a Java regex? If so, can you please attempt it or give me a hint?

Upvotes: 2

Views: 3595

Answers (7)

Jeff Hillman
Jeff Hillman

Reputation: 7598

When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );

Upvotes: 2

Alan Moore
Alan Moore

Reputation: 75222

Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.

Upvotes: 4

Alan Moore
Alan Moore

Reputation: 75222

Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}

Upvotes: 0

Dov Wasserman
Dov Wasserman

Reputation: 2682

After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"

Upvotes: 0

PabloG
PabloG

Reputation: 26715

Personally, I don't use Java, but this RegExp could do the trick:

([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

At least, it seems to work fine in Python:

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret

Upvotes: 0

Niniki
Niniki

Reputation: 810

Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up: this link

YMMV

edit: SO didn't like that link. Here's the google search link: google. It was the first result.

Upvotes: 0

anjanb
anjanb

Reputation: 13857

text between quotes : Are the quotes within the same line or multiple lines ?

Upvotes: 0

Related Questions