Mason
Mason

Reputation: 8846

Remove HTML tags from a String

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\<.*?>", "") 

will work, but some things like &amp; won't be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

Upvotes: 493

Views: 586056

Answers (30)

BalusC
BalusC

Reputation: 1109462

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

Upvotes: 656

RealHowTo
RealHowTo

Reputation: 35407

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Upvotes: 31

Ken Goodridge
Ken Goodridge

Reputation: 4011

If you're writing for Android you can do this...

androidx.core.text.HtmlCompat.fromHtml(instruction,HtmlCompat.FROM_HTML_MODE_LEGACY).toString()

Upvotes: 305

Ahmad Nabeel Butt
Ahmad Nabeel Butt

Reputation: 1

You can use this code to remove HTML tags including line breaks.

function remove_html_tags(html) {
    html = html.replace(/<div>/g, "").replace(/<\/div>/g, "<br>");
    html = html.replace(/<br>/g, "$br$");
    html = html.replace(/(?:\r\n|\r|\n)/g, '$br$');
    var tmp = document.createElement("DIV");
    tmp.innerHTML = html;
    html = tmp.textContent || tmp.innerText;
    html = html.replace(/\$br\$/g, "\n");
    return html;
}

Upvotes: -1

Arefe
Arefe

Reputation: 12451

You can use this method to remove the HTML tags from the String,

public static String stripHtmlTags(String html) {

    return html.replaceAll("<.*?>", "");

}

Upvotes: 1

Muneeb Ahmed
Muneeb Ahmed

Reputation: 29

Try this for javascript:

const strippedString = htmlString.replace(/(<([^>]+)>)/gi, "");
console.log(strippedString);

Upvotes: 1

jiamo
jiamo

Reputation: 1456

Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.

Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);

While only using Jsoup.parse(htmlstrl).text() can't remove tags.

Upvotes: 1

Parker
Parker

Reputation: 7519

I often find that I only need to strip out comments and script elements. This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML:

// delete all comments
response = response.replaceAll("<!--[^>]*-->", "");
// delete all script elements
response = response.replaceAll("<(script|SCRIPT)[^+]*?>[^>]*?<(/script|SCRIPT)>", "");

Upvotes: 0

Jared Beach
Jared Beach

Reputation: 3163

Worth noting that if you're trying to accomplish this in a Service Stack project, it's already a built-in string extension

using ServiceStack.Text;
// ...
"The <b>quick</b> brown <p> fox </p> jumps over the lazy dog".StripHtml();

Upvotes: 0

Itay Sasson
Itay Sasson

Reputation: 1

I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:

Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
    Source source= new Source(htmlAsString);
 Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
                        String clearedHtml= m.replaceAll("");

Upvotes: 0

Guilherme Oliveira
Guilherme Oliveira

Reputation: 121

classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim() 

Upvotes: 1

Anurag SL
Anurag SL

Reputation: 131

You can simply use the Android's default HTML filter

    public String htmlToStringFilter(String textToFilter){

    return Html.fromHtml(textToFilter).toString();

    }

The above method will return the HTML filtered string for your input.

Upvotes: 9

silentsudo
silentsudo

Reputation: 6973

Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)

content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.

Upvotes: 5

Sandeep1699
Sandeep1699

Reputation: 121

This should work -

use this

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

and this

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.

Upvotes: 12

Chris Marasti-Georg
Chris Marasti-Georg

Reputation: 34680

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

Upvotes: 100

Damien
Damien

Reputation: 2254

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

  • It removes line breaks from the text
  • It converts text &lt;script&gt; into <script>

If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

And here is a bunch of test cases (input to output):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

If you find a way to make it better, please let me know.

Upvotes: 18

IntelliJ Amiya
IntelliJ Amiya

Reputation: 75798

Use Html.fromHtml

HTML Tags are

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

Example

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

This is about me text that the user can put into their profile

Upvotes: 5

RobMen
RobMen

Reputation: 11

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");

Upvotes: 1

Ameen Maheen
Ameen Maheen

Reputation: 2747

On Android, try this:

String result = Html.fromHtml(html).toString();

Upvotes: 21

Satya Prakash
Satya Prakash

Reputation: 7

Remove HTML tags from string. Somewhere we need to parse some string which is received by some responses like Httpresponse from the server.

So we need to parse it.

Here I will show how to remove html tags from string.

    // sample text with tags

    string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>";



    // regex which match tags

    System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");



    // replace all matches with empty strin

    str = rx.Replace(str, "");



    //now str contains string without html tags

Upvotes: -1

Josh
Josh

Reputation: 199

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

Upvotes: 19

Stephan
Stephan

Reputation: 43053

Alternatively, one can use HtmlCleaner:

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}

Upvotes: 5

Tim Howland
Tim Howland

Reputation: 7970

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.

Upvotes: 12

surfealokesea
surfealokesea

Reputation: 5126

To get formateed plain html text you can do that:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change <br/> by \n and change last line by:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

Upvotes: 0

Maksim Sorokin
Maksim Sorokin

Reputation: 2414

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())

Upvotes: 2

blackStar
blackStar

Reputation: 339

Here is another way to do it:

public static String removeHTML(String input) {
    int i = 0;
    String[] str = input.split("");

    String s = "";
    boolean inTag = false;

    for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
        inTag = true;
    }
    if (!inTag) {
        for (i = 0; i < str.length; i++) {
            s = s + str[i];
        }
    }
    return s;
}

Upvotes: 2

Alexander
Alexander

Reputation: 73

My 5 cents:

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}

Upvotes: 0

Mike
Mike

Reputation: 51

Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    && !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}

Upvotes: 5

Mark
Mark

Reputation:

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

Upvotes: 4

rqualis
rqualis

Reputation: 41

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

noHTMLString.replaceAll("\\&.*?\\;", "");

instead of this:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");

Upvotes: 4

Related Questions