Reputation: 3238
question related to this
I have a string
a\;b\\;c;d
which in Java looks like
String s = "a\\;b\\\\;c;d"
I need to split it by semicolon with following rules:
If semicolon is preceded by backslash, it should not be treated as separator (between a and b).
If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between b and c).
So semicolon should be treated as separator if there is either zero or even number of backslashes before it.
For example above, I want to get following strings (double backslashes for java compiler):
a\;b\\
c
d
Upvotes: 8
Views: 15999
Reputation: 12930
I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C
since it's ages ago I last touched Java
;-)
int i, len, state;
char c;
for (len=myString.size(), state=0, i=0; i < len; i++) {
c=myString[i];
if (state == 0) {
if (c == '\\') {
state++;
} else if (c == ';') {
printf("; at offset %d", i);
}
} else {
state--;
}
}
The advantages are:
EDIT: I have added a complete C++ example for clarification.
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
std::vector<std::string> unescapeString(const char* s)
{
std::vector<std::string> result;
std::stringstream ss;
bool has_chars;
int state;
for (has_chars = false, state = 0;;) {
auto c = *s++;
if (state == 0) {
if (!c) {
if (has_chars) result.push_back(ss.str());
break;
} else if (c == '\\') {
++state;
} else if (c == ';') {
if (has_chars) {
result.push_back(ss.str());
has_chars = false;
ss.str("");
}
} else {
ss << c;
has_chars = true;
}
} else /* if (state == 1) */ {
if (!c) {
ss << '\\';
result.push_back(ss.str());
break;
}
ss << c;
has_chars = true;
--state;
}
}
return result;
}
int main(int argc, char* argv[])
{
for (size_t i = 1; i < argc; ++i) {
for (const auto& s: unescapeString(argv[i])) {
std::cout << s << std::endl;
}
}
}
Upvotes: 1
Reputation: 53
This is the real answer i think.
In my case i am trying to split using |
and escape character is &
.
final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
System.out.println(Arrays.toString(res));
In this code i am using Lookbehind to escape & character. note that the look behind must have maximum length.
(?<!((?:[^&]|^)(&&){0,10000}&))\\|
this means any |
except those that are following ((?:[^&]|^)(&&){0,10000}&))
and this part means any odd number of &
s.
the part (?:[^&]|^)
is important to make sure that you are counting all of the &
s behind the |
to the beginning or some other characters.
Upvotes: 0
Reputation: 5728
This approach assumes that your string will not have char '\0'
in your string. If you do, you can use some other char.
public static String[] split(String s) {
String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
for (int i = 0; i < result.length; i++) {
result[i] = result[i].replaceAll("\0", "\\\\;");
}
return result;
}
Upvotes: 0
Reputation: 26930
String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");
This should work.
Explanation :
// (?<!(?<!\\)\\);
//
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
// Match the character “\” literally «\\»
// Match the character “\” literally «\\»
// Match the character “;” literally «;»
So you just match the semicolons not preceded by exactly one \
.
EDIT :
String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");
This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :
// (?<!(?<!\\(\\\\){0,2000000})\\);
//
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
// Match the character “\” literally «\\»
// Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
// Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
// Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
// Match the character “\” literally «\\»
// Match the character “\” literally «\\»
// Match the character “\” literally «\\»
// Match the character “;” literally «;»
Upvotes: 0
Reputation: 336188
You can use the regex
(?:\\.|[^;\\]++)*
to match all text between unescaped semicolons:
List<String> matchList = new ArrayList<String>();
try {
Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
Explanation:
(?: # Match either...
\\. # any escaped character
| # or...
[^;\\]++ # any character(s) except semicolon or backslash; possessive match
)* # Repeat any number of times.
The possessive match (++
) is important to avoid catastrophic backtracking because of the nested quantifiers.
Upvotes: 9