Reputation: 63
I am new to regex world. Any help with below topic will be much appreciated.
I have a java string which contains xml content. I want to replace all the image tags that occurs between 2 types of parent tags. (<Comment> and <Link>
)
Example :
String input = "<Comment> 1 2<img>This should be removed</img> 4 </Comment><Link>5 <img>This should be removed</img> 6</Link> <Comment> 7 <img>This should be removed</img> 8 </Comment>";
Required output = "<Comment> 1 2 4 </Comment><Link>5 6</Link> <Comment> 7 8 </Comment>"
;
I have working code below which perfectly replaces all occurrences of image tags within all <Comment>
tag. I am struck with checking it against both tags together i.e <Comment> and <Link>
. Please ignore logic of replacing tags within while loop as I am yet to update it. I`m struck at line no 1. i.e passing multiple patterns and identifying groups.
Pattern pattern = Pattern.compile("<comments>(.*?)</comments>");
Matcher matcher = pattern.matcher(input );
while (matcher.find()) {
String comment = matcher.group(1);
String replace = "<comments>" + comment + "</comments>";
Document document = Jsoup.parse(replace, "", Parser.xmlParser());
String cleanPdfXml = Jsoup.clean(document.select("comments").text(), Whitelist.relaxed());
String replacedTo = StringEscapeUtils.escapeXml(cleanPdfXml.replace("\n", ""));
replacedTo = "<comments>" + replacedTo + "</comments>";
input = input .replace(replace, replacedTo);
}
Upvotes: 0
Views: 173
Reputation: 2245
I am not sure why you need JSoup. May be you have another purpose of using it. This can, hwowever, be solved using purely regex. Here's a way. I haven't tested for nested tags, though, that is, eg a Link
tag within a Comment
tag. We may need to adapt this logic for that.
Here, there is one more pattern being used for selecting and removing the img
tag.
Pattern pattern = Pattern.compile( "<(Comment|Link)>(.*?)</\\1>", Pattern.CASE_INSENSITIVE );
Pattern imgPattern = Pattern.compile( "<img>.*</img>", Pattern.CASE_INSENSITIVE );
Matcher matcher = pattern.matcher(input );
while (matcher.find()) {
String tag = matcher.group(1);
String text = matcher.group(2);
System.out.println( "Found: " + text );
text = imgPattern.matcher( text ).replaceAll( "" );
String newText = "<" + tag + ">" + text + "</" + tag + ">";
System.out.println( newText );
}
Upvotes: 1
Reputation: 107
You can use the following approach:
String inputString = "<Comment> 1 2<img>This should be removed</img> 4 </Comment><Link>5 <img>This should be removed</img> " +
"6</Link> <Comment> 7 <img>This should be removed</img> 8 </Comment>";
String outputString = inputString.replaceAll("(?s)<img>.*?</img>", "");
System.out.println( outputString);
Upvotes: 1