Reputation: 5092
Say I have a String
like this:
String s="social network such as '''[http://www.facebook.com Facebook]''' , "+
"'''[http://www.twitter.com Twitter]''' and '''[http://www.tumblr.com tumblr]'''";
and I need to retrieve only those Strings
within '''[
and ]'''
.
example output:
http://www.facebook.com Facebook, http://www.twitter.com Twitter, http://www.tumblr.com tumblr
I'm having a difficulty doing this using regex
, so I came with this idea using recursion
:
System.out.println(filter(s, "'''[", "]'''"));
....
public static String filter(String s, String open, String close){
int start = s.indexOf(open);
int end = s.indexOf(close);
filtered = filtered + s.substring(start + open.length(), end) + ", ";
s = s.substring(end + close.length(), s.length());
if(s.indexOf(open) >= 0 && s.indexOf(close) >= 0)
return filter(s, open, close);
else
return filtered.substring(0, filtered.length() - 2);
}
but in some case, where I need to retrieve words within the same pattern of the String
such as within '''
and '''
, it will say String index out of range because start
and end
will hold the same value.
How can I overcome this? Is regex
the only solution?
Upvotes: 1
Views: 102
Reputation: 11733
You can use the string tokenizer for this very easily. Simply hand the whole string to the tokenizer then ask for each token and check if it begins with your delimiter. If it does, extract the contents into your results collection.
The string tokenizer version will be less upped and not as ugly as the regent solution.
Here is the tokenizer version:
public class TokenizerTest {
@Test
public void canExtractNamesFromTokens(){
String openDelimiter = "'''[";
String closeDelimiter = "]'''";
String s="social network such as '''[http://www.facebook.com Facebook]''' , "+
"'''[http://www.twitter.com Twitter]''' and '''[http://www.tumblr.com tumblr]'''";
StringTokenizer t = new StringTokenizer(s);
while (t.hasMoreElements()){
String token = t.nextToken();
if (token.startsWith(openDelimiter)){
String url = token.substring(openDelimiter.length());
token = t.nextToken();
String siteName = token.substring(0, token.length()-closeDelimiter.length());
System.out.println(url + " " + siteName);
}
}
}
}
Not sure how this could get any simpler or cleaner. Absolutely clear what the code is doing.
Upvotes: 0
Reputation: 425033
Never mind all that code in other answers... You can do it in one line:
String[] urls = str.replaceAll("^.*?'''\\[|\\]'''(?!.*\\]''').*", "").split("\\]'''.*?'''\\[");
This first strips off the leading and trailing jetsam and then splits on a delimiter that matches everything between the targets.
This can be adapted to a flexible solution that has variable delimiters:
public static String[] extract(String str, String open, String close) {
return str.replaceAll("^.*?(\\Q" + open + "\\E|$)|\\Q" + close + "\\E(?!.*\\Q" + close + "\\E).*", "").split("\\Q" + close + "\\E.*?\\Q" + open + "\\E");
}
This regex also caters for there being no targets by returning an array with a single blank element.
P.S. this is the first time I can recall using the quote syntax \Q...\E
to treat characters in the regex as literals, so I'm chuffed about that.
I would also like to claim some bragging rights for typing the whole thing on my iPhone (note that means there could be a character or two out of place, but it should be pretty close).
Upvotes: 2
Reputation: 25297
Regex is the right tool for this. Use Pattern
and Matcher
.
public static String filter(String s, String open, String close){
Pattern p = Pattern.compile(Pattern.quote(open) + "(.*?)" + Pattern.quote(close));
Matcher m = p.matcher(s);
StringBuilder filtered = new StringBuilder();
while (m.find()){
filtered.append(m.group(1)).append(", ");
}
return filtered.substring(0, filtered.length() - 2); //-2 because trailing ", "
}
Pattern.quote
ensures that any special characters for open
and close
are treated as regular ones.
m.group()
returns the group from the last String
matched by m.find()
.
m.find()
finds all substrings that match the regex.
Note: in both of these, end
is assigned s.indexOf(close, start + 1)
, using String#indexOf(String, int)
and StringBuilder#indexOf(String, int)
so that even if the open
and close
values are the same, no error occurs.
Recursion:
public static String filter(String s, String open, String close){
int start = s.indexOf(open);
int end = s.indexOf(close, start + 1);
//I took the liberty of adding "String" and renaming your variable
String get = s.substring(start + open.length(), end);
s = s.substring(end + close.length());
if (s.indexOf(open) == -1){
return get;
}
return get + ", " + filter(s, open, close);
}
Rather than adding the ", "
right off the bat, it is a little easier to deal with it later. Also, note that s.substring(end + close.length(), s.length())
is the same as s.substring(end + close.length());
Also, I feel that it is neater to see if s.indexOf(...) == -1
rather than checking for >=0
.
The real problem lies in the way you treat filtered
. First of all, you need to declare filtered
as type String
. Next, since you are doing recursion, you shouldn't concatenate to filtered
. That would make the line where we first see filtered
: String filtered = s.substring(start + open.length(), end) + ", ";
. If you fix that line, your solution works.
Iterative:
public static String filter(String str, String open, String close){
int open_length = open.length();
int close_length = close.length();
StringBuilder s = new StringBuilder(str);
StringBuilder filtered = new StringBuilder();
for (int start = s.indexOf(open), end = s.indexOf(close, start + 1); start != -1;
start = s.indexOf(open), end = s.indexOf(close, start + 1)){
filtered.append(s.substring(start + open_length, end)).append(", ");
s.delete(0, end + close_length);
}
return filtered.substring(0, filtered.length() - 2); //trailing ", "
}
This iterative method makes use of StringBuilder
, but the same can be done without it. It makes two StringBuilder
s, one empty one, and one that holds the value of the original String
. In the for
loop:
int start = s.indexOf(open), end = s.indexOf(close)
gets a reference to the indicesstart != -1
ends the loop if s
does not contain open
start = s.indexOf(open), end = s.indexOf(close)
after each iteration of the loop, find the indices again.The inside of the loop appends the correct substring to finished
and removes the appended part from the other StringBuilder
.
Upvotes: 2