Baby
Baby

Reputation: 5092

Finding the part of a String that is wrapped in delimeters

Say I have a String like this:

String s="social network such as '''[http://www.facebook.com Facebook]''' , "+
"'''[http://www.twitter.com Twitter]''' and '''[http://www.tumblr.com tumblr]'''";

and I need to retrieve only those Strings within '''[ and ]'''.

example output:

http://www.facebook.com Facebook, http://www.twitter.com Twitter, http://www.tumblr.com   tumblr

I'm having a difficulty doing this using regex, so I came with this idea using recursion:

System.out.println(filter(s, "'''[",  "]'''"));
....

public static String filter(String s, String open, String close){   
  int start = s.indexOf(open);
  int end = s.indexOf(close);

  filtered = filtered + s.substring(start + open.length(), end) + ", ";
  s = s.substring(end + close.length(), s.length());

  if(s.indexOf(open) >= 0 && s.indexOf(close) >= 0)
     return filter(s, open, close);

  else
     return filtered.substring(0, filtered.length() - 2);
}

but in some case, where I need to retrieve words within the same pattern of the String such as within ''' and ''', it will say String index out of range because start and end will hold the same value.

How can I overcome this? Is regex the only solution?

Upvotes: 1

Views: 102

Answers (3)

Rob
Rob

Reputation: 11733

You can use the string tokenizer for this very easily. Simply hand the whole string to the tokenizer then ask for each token and check if it begins with your delimiter. If it does, extract the contents into your results collection.

The string tokenizer version will be less upped and not as ugly as the regent solution.

Here is the tokenizer version:

public class TokenizerTest {

    @Test
    public void canExtractNamesFromTokens(){
        String openDelimiter = "'''[";
        String closeDelimiter = "]'''";
        String s="social network such as '''[http://www.facebook.com Facebook]''' , "+
            "'''[http://www.twitter.com Twitter]''' and '''[http://www.tumblr.com tumblr]'''";

        StringTokenizer t = new StringTokenizer(s);

        while (t.hasMoreElements()){
            String token = t.nextToken();
            if (token.startsWith(openDelimiter)){
                String url = token.substring(openDelimiter.length());
                token = t.nextToken();
                String siteName = token.substring(0, token.length()-closeDelimiter.length());
                System.out.println(url + " " + siteName);
            }
        }
   }
}

Not sure how this could get any simpler or cleaner. Absolutely clear what the code is doing.

Upvotes: 0

Bohemian
Bohemian

Reputation: 425033

Never mind all that code in other answers... You can do it in one line:

String[] urls = str.replaceAll("^.*?'''\\[|\\]'''(?!.*\\]''').*", "").split("\\]'''.*?'''\\[");

This first strips off the leading and trailing jetsam and then splits on a delimiter that matches everything between the targets.


This can be adapted to a flexible solution that has variable delimiters:

public static String[] extract(String str, String open, String close) {
    return str.replaceAll("^.*?(\\Q" + open + "\\E|$)|\\Q" + close + "\\E(?!.*\\Q" + close + "\\E).*", "").split("\\Q" + close + "\\E.*?\\Q" + open + "\\E");
}

This regex also caters for there being no targets by returning an array with a single blank element.

P.S. this is the first time I can recall using the quote syntax \Q...\E to treat characters in the regex as literals, so I'm chuffed about that.

I would also like to claim some bragging rights for typing the whole thing on my iPhone (note that means there could be a character or two out of place, but it should be pretty close).

Upvotes: 2

Justin
Justin

Reputation: 25297

Regex is the right tool for this. Use Pattern and Matcher.

public static String filter(String s, String open, String close){
    Pattern p = Pattern.compile(Pattern.quote(open) + "(.*?)" + Pattern.quote(close));
    Matcher m = p.matcher(s);

    StringBuilder filtered = new StringBuilder();

    while (m.find()){
        filtered.append(m.group(1)).append(", ");
    }
    return filtered.substring(0, filtered.length() - 2); //-2 because trailing ", "
}

Pattern.quote ensures that any special characters for open and close are treated as regular ones.

m.group() returns the group from the last String matched by m.find().

m.find() finds all substrings that match the regex.


Non-regex Solutions:

Note: in both of these, end is assigned s.indexOf(close, start + 1), using String#indexOf(String, int) and StringBuilder#indexOf(String, int) so that even if the open and close values are the same, no error occurs.

Recursion:

public static String filter(String s, String open, String close){
    int start = s.indexOf(open);
    int end = s.indexOf(close, start + 1);

    //I took the liberty of adding "String" and renaming your variable
    String get = s.substring(start + open.length(), end);
    s = s.substring(end + close.length());

    if (s.indexOf(open) == -1){
        return get;
    }
    return get + ", " + filter(s, open, close);
}

Rather than adding the ", " right off the bat, it is a little easier to deal with it later. Also, note that s.substring(end + close.length(), s.length()) is the same as s.substring(end + close.length()); Also, I feel that it is neater to see if s.indexOf(...) == -1 rather than checking for >=0.

The real problem lies in the way you treat filtered. First of all, you need to declare filtered as type String. Next, since you are doing recursion, you shouldn't concatenate to filtered. That would make the line where we first see filtered: String filtered = s.substring(start + open.length(), end) + ", ";. If you fix that line, your solution works.

Iterative:

public static String filter(String str, String open, String close){
    int open_length = open.length();
    int close_length = close.length();

    StringBuilder s = new StringBuilder(str);
    StringBuilder filtered = new StringBuilder();

    for (int start = s.indexOf(open), end = s.indexOf(close, start + 1); start != -1; 
        start = s.indexOf(open), end = s.indexOf(close, start + 1)){
        filtered.append(s.substring(start + open_length, end)).append(", ");
        s.delete(0, end + close_length);
    }

    return filtered.substring(0, filtered.length() - 2); //trailing ", "
}

This iterative method makes use of StringBuilder, but the same can be done without it. It makes two StringBuilders, one empty one, and one that holds the value of the original String. In the for loop:

  • int start = s.indexOf(open), end = s.indexOf(close) gets a reference to the indices
  • start != -1 ends the loop if s does not contain open
  • start = s.indexOf(open), end = s.indexOf(close) after each iteration of the loop, find the indices again.

The inside of the loop appends the correct substring to finished and removes the appended part from the other StringBuilder.

Upvotes: 2

Related Questions