CMZS
CMZS

Reputation: 601

java regular expression: conditionally spilt string by capital letters

I am not familiar with regular expression. Maybe this is a simple problem. Given a string

XYZHelloWorldT

I need to return an string array as

{XYZ Hello World T}

That is, take all the words that start with exactly one capital letter and followed by one or more small letters or multiple capital letters, followed by a capital letter starting a new word. The remaining part is separated by the vacancies to be the other elements in the string array.

I can work on the characters directly. Just wonder whether I could do it by regular expression directly in string's split method? I found something like this Java: Split string when an uppercase letter is found but not sure how to use it to solve my problem. Thanks

Upvotes: 1

Views: 279

Answers (2)

Marcelo Ferreira
Marcelo Ferreira

Reputation: 466

This is algorithm in Java for find this words, but only not recommend for big texts, also not includes numbers and whitespace.

public class TestString
{
	static int	i	= 0, lenght;
	static char array[];
	
	public static void main(String[] args){
		String result = "XYZHelloWorldTRTTTePoPoIiiiiiooY";
		array = result.toCharArray();
		lenght=array.length;

		StringBuffer words = new StringBuffer();
		for(; i< lenght; i++){
			words.append(makeArray());
		}
		String resultOut[]= words.toString().split(",");
		for(String key: resultOut){
			System.out.println(key);
		}
		System.exit(0);
	}

	private static String makeArray()
	{
		StringBuffer word = new StringBuffer();
		String upper, normal;
		boolean lower=false;
		for(; i< lenght; ++i){
			word.append(array[i]);
			if(i<lenght-2){
				upper=String.valueOf(array[i+1]).toUpperCase();
				normal=String.valueOf(array[i+1]);
				if(upper.equals(normal)){
					upper=String.valueOf(array[i+2]).toUpperCase();
					normal=String.valueOf(array[i+2]);
					if(upper.equals(normal)){
						if(lower){
							break;
						}
						continue;
						
					}else{
						break;
					}
				}else{
					lower=true;
					continue;
				}
			}else{
				if(lower && i<lenght-1){
					String lowerStr=String.valueOf(array[i+1]).toLowerCase();
					normal=String.valueOf(array[i+1]);
					if(lowerStr.equals(normal)){
						continue;
					}else{
						break;
						
					}
				}
				break;
			}
		}
		word.append(",");
		return word.toString();
	}
}

what's your plan to use this regex in my algorithm?

Upvotes: 0

ndnenkov
ndnenkov

Reputation: 36100

Since you can have multiple consecutive upper case letters, you want to have lookbehind for lower case as well as lookahead for upper case:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

If you want support for other languages, you should use posix character classes:

(?<=\\p{Lower})(?=\\p{Upper})|(?<=\\p{Upper})(?=\\p{Upper}\\p{Lower})

The first alternation will match if you are between lowercase and uppercase letters. The second one - if you are between an upper case and another upper case, followed by lower case.

Upvotes: 3

Related Questions