user1103205
user1103205

Reputation: 185

How can i parse this using StringUtils or Regular Expression

Im currently facing the issue where

<a href="<a href="http://www.freeformatter.com/xml-formatter.html#ad-output" target="_blank">http://www.freeformatter.com/xml-formatter.html#ad-output</a>">Links</a>

Is being returned from a service I am using. As you can see this is NOT valid html. Does anyone know any tools or regular expressions that can help me remove the inner tag to change it to this:

<a href="http://www.freeformatter.com/xml-formatter.html#ad-output">Links</a>

EDIT: The service does not always return freeformatter.com website. It could return ANY website

Upvotes: 1

Views: 486

Answers (4)

Braj
Braj

Reputation: 46841

Solution 1

Simply use the grouping feature of the regex that is captured by parenthesis (). Get the matched group using Matcher.group() method.

Find all the occurrence between > and < and combine it as per your need.

Here is the regex pattern >([^\">].*?)<. Have a look at the demo on debuggex and regex101

Pattern description:

.       Any character (may or may not match line terminators)
[^abc]  Any character except a, b, or c (negation)
X*?     X, zero or more times (Reluctant quantifiers)
(X)     X, as a capturing group

Read more about

Sample code:

String string = "<a href=\"<a href=\"http://www.freeformatter.com/xml-formatter.html#ad-output\" target=\"_blank\">http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">Links</a>";

Pattern p = Pattern.compile(">([^\">].*?)<");
Matcher m = p.matcher(string);

while (m.find()) {
    System.out.println(m.group(1));
}

output:

http://www.freeformatter.com/xml-formatter.html#ad-output
Links

Solution 2

Try with String#replaceAll() method using (</a>)[^$]|([^^]<a(.*?)>) regex pattern.

Pattern says: Replace all the </a> that is not in the end and <a.*?> that is not in the beginning with the double quotes.

Find demo on regex101 and debuggex

Pictorial representation of this regex pattern:

enter image description here

Sample code:

String string = "<a href=\"<a href=\"http://www.freeformatter.com/xml-formatter.html#ad-output\" target=\"_blank\">http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">Links</a>";

System.out.println(string.replaceAll("(</a>)[^$]|([^^]<a(.*?)>)", "\""));

output:

<a href="http://www.freeformatter.com/xml-formatter.html#ad-output">Links</a>

Upvotes: 0

l&#39;L&#39;l
l&#39;L&#39;l

Reputation: 47169

If the URL or content within the tags changes you'll want to use a more generalized pattern perhaps:

(<a\\shref=\"\\w.+\")\\s.+>\"(.+</a>)

This essentially captures the portions of the string you want into two groups; which can then be reassembled into one string. Here's a working example:

http://ideone.com/TbOvVa

Upvotes: 1

a_river_in_canada
a_river_in_canada

Reputation: 299

grab the first a href=" with .substring(0,8) then use .split("\">",1) and use the resulting array at index 1.

Upvotes: 0

La-comadreja
La-comadreja

Reputation: 5755

In Java:

String s = "<a href=\"<a href=\"http://www.freeformatter.com/xml-formatter.html#ad-output\" target=\"_blank\">http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">Links</a>;

(You'll need to save it as a String somehow in your program)

Then:

s = s.replace("<a href=\"", "");
String[] pcs = s.split("http://www.freeformatter.com/xml-formatter.html#ad-output</a>\">");
s = pcs[0] + pcs[1];
s = s.replace(" target=\"_blank\"", "");

You would have the right ref after all this processing.

Upvotes: 0

Related Questions