itsadok
itsadok

Reputation: 29342

How to properly trim whitespaces from a string in Java?

The JDK's String.trim() method is pretty naive, and only removes ascii control characters.

Apache Commons' StringUtils.strip() is slightly better, but uses the JDK's Character.isWhitespace(), which doesn't recognize non-breaking space as whitespace.

So what would be the most complete, Unicode-compatible, safe and proper way to trim a string in Java?

And incidentally, is there a better library than commons-lang that I should be using for this sort of stuff?

Upvotes: 46

Views: 25761

Answers (6)

Aleksandr Dubinsky
Aleksandr Dubinsky

Reputation: 23515

This handles Unicode characters and doesn't require extra libraries:

String trimmed = original.replaceAll ("^\\p{IsWhite_Space}+|\\p{IsWhite_Space}+$", "");

A slight snag is that there are some related whitespace characters without Unicode character property "WSpace=Y" which are listed in Wikipedia. These probably won't cause a problem, but you can easy add them to the character class too.

Using almson-regex the regex will look like:

String trimmed = original.replaceAll (either (START_BOUNDARY + oneOrMore (WHITESPACE), oneOrMore (WHITESPACE) + END BOUNDARY), "");

and include the more relevant of the non-Unicode whitespace.

Upvotes: -1

Ertuğrul Çetin
Ertuğrul Çetin

Reputation: 5231

I did little changes on java's trim() method and it supports non-ascii characters.This method runs faster than most of the implementations.

public static String trimAdvanced(String value) {

        Objects.requireNonNull(value);

        int strLength = value.length();
        int len = value.length();
        int st = 0;
        char[] val = value.toCharArray();

        if (strLength == 0) {
            return "";
        }

        while ((st < len) && (val[st] <= ' ') || (val[st] == '\u00A0')) {
            st++;
            if (st == strLength) {
                break;
            }
        }
        while ((st < len) && (val[len - 1] <= ' ') || (val[len - 1] == '\u00A0')) {
            len--;
            if (len == 0) {
                break;
            }
        }


        return (st > len) ? "" : ((st > 0) || (len < strLength)) ? value.substring(st, len) : value;
    }

Upvotes: 0

ZZ Coder
ZZ Coder

Reputation: 75496

It's really hard to define what constitutes white spaces. Sometimes I use non-breakable spaces just to make sure it doesn't get stripped. So it will be hard to find a library to do exactly what you want.

I use my own trim() if I want trim every white space. Here is the function I use to check for white spaces,

  public static boolean isWhitespace (int ch)
  {
    if (ch == ' ' || (ch >= 0x9 && ch <= 0xD))
      return true;
    if (ch < 0x85) // short-circuit optimization.
      return false;
    if (ch == 0x85 || ch == 0xA0 || ch == 0x1680 || ch == 0x180E)
      return true;
    if (ch < 0x2000 || ch > 0x3000)
      return false;
    return ch <= 0x200A || ch == 0x2028 || ch == 0x2029
      || ch == 0x202F || ch == 0x205F || ch == 0x3000;
  }

Upvotes: 4

CrazyCoder
CrazyCoder

Reputation: 402265

Google has made guava-libraries available recently. It may have what you are looking for:

CharMatcher.inRange('\0', ' ').trimFrom(str)

is equivalent to String.trim(), but you can customize what to trim, refer to the JavaDoc.

For instance, it has its own definition of WHITESPACE which differs from the JDK and is defined according to the latest Unicode standard, so what you need can be written as:

CharMatcher.WHITESPACE.trimFrom(str)

Upvotes: 61

itsadok
itsadok

Reputation: 29342

I swear I only saw this after I posted the question: Google just released Guava, a library of core Java utilities.

I haven't tried this yet, but from what I can tell, this is fully Unicode compliant:

String s = "  \t testing \u00a0"
s = CharMatcher.WHITESPACE.trimFrom(s);

Upvotes: 8

Jo&#227;o Silva
Jo&#227;o Silva

Reputation: 91349

I've always found trim to work pretty well for almost every scenario.

However, if you really want to include more characters, you can edit the strip method from commons-lang to include not only the test for Character.isWhitespace, but also for Character.isSpaceChar which seems to be what's missing. Namely, the following lines at stripStart and stripEnd, respectively:

  • while ((start != strLen) && Character.isWhitespace(str.charAt(start)))
  • while ((end != 0) && Character.isWhitespace(str.charAt(end - 1)))

Upvotes: 2

Related Questions