Reputation: 13334
I am looking for Java solution to replace line breaks with <br/>
tags in all multi-line text fields in a given HTML string, that are not enclosed in any tags (children of an imaginary root).
The source data is an HTML-formatted text created via front-end HTML editor (like TinyMCE). So it's an arbitrary HTML fragment - a part of a non-existing <body>
.
The following:
text11
text 21<p>tagged text1
tagged text2</p>
text 2
Should become:
text11<br/>text 21<p>tagged text1
tagged text2</p></br>text 2
The following, however, should not be impacted at all:
<div>text11
text 21<p>tagged text1
tagged text2</p>
text 2</div>
I was thinking about something like this (not working):
private static String ReplaceLfWithBr(String source) {
// text - combination of words and line breaks
// should not be preceded by <tag> or followed by <\tag>
final String regex = "((?!<.+>)[\\w(\\r?\\n)]+(?!<\\s*/.+>))";
Pattern patern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = patern.matcher(source);
StringBuffer sb = new StringBuffer(source.length());
while(matcher.find()){
matcher.appendReplacement(sb, "<br/>");
}
matcher.appendTail(sb);
return sb.toString();
}
Upvotes: 3
Views: 1772
Reputation: 13334
This is how I made it to work (extremely close to the accepted answer)
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Parser;
public class HtmlText {
public static void main(String[] args) {
String test = "text1\ntext2<tag>tagged text \n tagged continue</tag> \ntext3";
System.out.println("-----=============----------");
System.out.println(test);
System.out.println("-----=============----------");
System.out.println(ReplaceWithSoup(test));
}
private static String ReplaceWithSoup(String source) {
StringBuilder sbResult = new StringBuilder();
Document doc = Jsoup.parseBodyFragment(source);
Element body = doc.body();
for(Node node: body.childNodes()) {
if(node instanceof TextNode) {
TextNode tn = (TextNode) node;
tn.text(tn.getWholeText().replace("\n","<br/>"));
}
sbResult.append(Parser.unescapeEntities(node.toString(), true));
}
return sbResult.toString();
}
}
Upvotes: 1
Reputation: 2309
So it's a little more complicated than what I said in my comment, but I think something like this might work:
public static void main (String[] args)
{
String text = "text11\n"
+ "text 21<p>tagged text1\n"
+ "tagged text2</p>\n"
+ "text 2";
StringBuilder sb = new StringBuilder("<body>");
sb.append(text);
sb.append("</body>");
Document doc = Jsoup.parseBodyFragment(sb.toString());
Element body = doc.select("body");
List<Node> children = body.childNodes();
StringBuilder sb2 = new StringBuilder();
for(Node n : children) {
if(n instanceof TextNode) {
n.text(n.getWholeText().replace("\n", "<br/>"));
}
sb2.append(n.toString());
}
System.out.println(sb2.toString());
}
Basically get all the Nodes
, do a replace on the TextNodes
, and put them back together. I'm not 100% sure this will work as-is, since I am not able to test it at the moment. But hopefully it gets the idea across.
What I said in my comment doesn't work because you have to be able to put the child elements back in place between the text. You can't do that if you just use getOwnText()
.
I haven't used Jsoup much myself, so improvements are welcome if anyone has any.
Upvotes: 1