Reputation: 1311
Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Their result is
Breaking \u003cb\u003eNews\u003c/b\u003e Updates
How can we remove these characters? I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results in Java.
Upvotes: 2
Views: 2389
Reputation: 40683
This is HTML. \u003cb\u003e
translates to <b>
.
You'll want to use an HTML parser as HTML is not fully parse-able by a regular expression.
With a library like Jsoup you could do this as.
String data = Jsoup.parse(html).body().text();
This will get you "Breaking News Updates"
.
Upvotes: 0
Reputation: 213213
You can use the below regex..
String str = "Breaking \u003cb\u003eNews\u003c/b\u003e Updates";
str = str.replaceAll("\\<(.*)?\\>(.*)\\</\\1\\>", "$2");
System.out.println(str);
OUTPUT: -
Breaking News Updates
\\<(.*)?\\>
matches the first opening tag - <b>
\\</\\1\\>
matches the corresponding closing tag - </b>
\\1
is used to backreference what was the tag, so that correct pair of tags are matched..So, <b>news <update></b>
-> In this case <update>
will not be removed..
Upvotes: 0
Reputation: 31
I pull those routinely with
String.replaceAll("\\p{Cntrl}","")
Upvotes: 1
Reputation: 466
The best solution would be to use JSON to convert the data.
JSON.parse(JSON.stringify({a : '<put your string here>'}));
It will be proper as the data you will get from Google API will be in the form of JSON.
Upvotes: 0