Viraj Udeg
Viraj Udeg

Reputation: 63

multiple regex patterns to replace multiple occurrences of image tags

I am new to regex world. Any help with below topic will be much appreciated.

I have a java string which contains xml content. I want to replace all the image tags that occurs between 2 types of parent tags. (<Comment> and <Link>)

Example :

String input = "<Comment> 1 2<img>This should be removed</img> 4 </Comment><Link>5 <img>This should be removed</img> 6</Link> <Comment> 7 <img>This should be removed</img> 8 </Comment>";

Required output = "<Comment> 1 2 4 </Comment><Link>5 6</Link> <Comment> 7 8 </Comment>";

I have working code below which perfectly replaces all occurrences of image tags within all <Comment> tag. I am struck with checking it against both tags together i.e <Comment> and <Link>. Please ignore logic of replacing tags within while loop as I am yet to update it. I`m struck at line no 1. i.e passing multiple patterns and identifying groups.

Pattern pattern = Pattern.compile("<comments>(.*?)</comments>");
        Matcher matcher = pattern.matcher(input );
        while (matcher.find()) {
            String comment = matcher.group(1);
            String replace = "<comments>" + comment + "</comments>";
            Document document = Jsoup.parse(replace, "", Parser.xmlParser());
            String cleanPdfXml = Jsoup.clean(document.select("comments").text(), Whitelist.relaxed());
            String replacedTo = StringEscapeUtils.escapeXml(cleanPdfXml.replace("\n", ""));
            replacedTo = "<comments>" + replacedTo + "</comments>";

            input = input .replace(replace, replacedTo);

        }

Upvotes: 0

Views: 173

Answers (2)

Sree Kumar
Sree Kumar

Reputation: 2245

I am not sure why you need JSoup. May be you have another purpose of using it. This can, hwowever, be solved using purely regex. Here's a way. I haven't tested for nested tags, though, that is, eg a Link tag within a Comment tag. We may need to adapt this logic for that.

Here, there is one more pattern being used for selecting and removing the img tag.

Pattern pattern = Pattern.compile( "<(Comment|Link)>(.*?)</\\1>", Pattern.CASE_INSENSITIVE );
Pattern imgPattern = Pattern.compile( "<img>.*</img>", Pattern.CASE_INSENSITIVE );
Matcher matcher = pattern.matcher(input );
while (matcher.find()) {
    String tag = matcher.group(1);
    String text = matcher.group(2);

    System.out.println( "Found: " + text );

    text = imgPattern.matcher( text ).replaceAll( "" );
    String newText = "<" + tag + ">" + text + "</" + tag + ">";

    System.out.println( newText );
}

Upvotes: 1

Magdalena Fairfax
Magdalena Fairfax

Reputation: 107

You can use the following approach:

String inputString = "<Comment> 1 2<img>This should be removed</img> 4 </Comment><Link>5 <img>This should be removed</img> " +
            "6</Link> <Comment> 7 <img>This should be removed</img> 8 </Comment>";
    String outputString = inputString.replaceAll("(?s)<img>.*?</img>", "");

    System.out.println( outputString);

Upvotes: 1

Related Questions