Reputation: 185
Im currently facing the issue where
<a href="<a href="http://www.freeformatter.com/xml-formatter.html#ad-output" target="_blank">http://www.freeformatter.com/xml-formatter.html#ad-output</a>">Links</a>
Is being returned from a service I am using. As you can see this is NOT valid html. Does anyone know any tools or regular expressions that can help me remove the inner tag to change it to this:
<a href="http://www.freeformatter.com/xml-formatter.html#ad-output">Links</a>
EDIT: The service does not always return freeformatter.com website. It could return ANY website
Upvotes: 1
Views: 486
Reputation: 46841
Simply use the grouping feature of the regex that is captured by parenthesis ()
. Get the matched group using Matcher.group()
method.
Find all the occurrence between > and < and combine it as per your need.
Here is the regex pattern >([^\">].*?)<
. Have a look at the demo on debuggex and regex101
Pattern description:
. Any character (may or may not match line terminators)
[^abc] Any character except a, b, or c (negation)
X*? X, zero or more times (Reluctant quantifiers)
(X) X, as a capturing group
Read more about
Sample code:
String string = "<a href=\"<a href=\"http://www.freeformatter.com/xml-formatter.html#ad-output\" target=\"_blank\">http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">Links</a>";
Pattern p = Pattern.compile(">([^\">].*?)<");
Matcher m = p.matcher(string);
while (m.find()) {
System.out.println(m.group(1));
}
output:
http://www.freeformatter.com/xml-formatter.html#ad-output
Links
Try with String#replaceAll()
method using (</a>)[^$]|([^^]<a(.*?)>)
regex pattern.
Pattern says: Replace all the </a>
that is not in the end and <a.*?>
that is not in the beginning with the double quotes.
Find demo on regex101 and debuggex
Pictorial representation of this regex pattern:
Sample code:
String string = "<a href=\"<a href=\"http://www.freeformatter.com/xml-formatter.html#ad-output\" target=\"_blank\">http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">Links</a>";
System.out.println(string.replaceAll("(</a>)[^$]|([^^]<a(.*?)>)", "\""));
output:
<a href="http://www.freeformatter.com/xml-formatter.html#ad-output">Links</a>
Upvotes: 0
Reputation: 47169
If the URL or content within the tags changes you'll want to use a more generalized pattern perhaps:
(<a\\shref=\"\\w.+\")\\s.+>\"(.+</a>)
This essentially captures the portions of the string you want into two groups; which can then be reassembled into one string. Here's a working example:
Upvotes: 1
Reputation: 299
grab the first a href=" with .substring(0,8) then use .split("\">",1) and use the resulting array at index 1.
Upvotes: 0
Reputation: 5755
In Java:
String s = "<a href=\"<a href=\"http://www.freeformatter.com/xml-formatter.html#ad-output\" target=\"_blank\">http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">Links</a>;
(You'll need to save it as a String somehow in your program)
Then:
s = s.replace("<a href=\"", "");
String[] pcs = s.split("http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">");
s = pcs[0] + pcs[1];
s = s.replace(" target=\"_blank\"", "");
You would have the right ref after all this processing.
Upvotes: 0