Removing all html markup

Question

I have a string that holds a complete XML get request.

In the request, there is a lot of HTML and some custom commands which I would like to remove.

The only way of doing so I know is by using jSoup.

For example like so.

Now, because the website the request came from also features custom commands, I was not able to completely remove all code.

For example here is a string I would like to 'clean':

\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com

Some more text here

As you can see, the custom commands all have backslashes in front of them.

How would I go about removing these commands with Java?

If I use regex, how can I program it such that it only removes the command, not anything after the command? (because if I softcode: I don't know the size of the command beforehand and I don't want to hardcode all the commands).

Floris · Accepted Answer

See http://regex101.com/r/gJ2yN2

The regex (\.\d{3,}.*?\s|(\r|\n)+) works to remove the things you were pointing out.

Result (replacing the match with a single space):

normal text here http://a_random_link_here.com Some more text here

If this was not the result you were looking for, please edit your question with the expected result.

EDIT regex explained:

()  - match everything inside the parentheses (later, the "match" gets replaced with "space")
\  - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
      so it is not interpreted as a special character
.   - any character (I saw 'u', but there might be others
\d  - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s  - until you hit a white space
|   - or
()  - one of these things
\r - backslash - r (again, with escaped '\')
\n - backslash - n

Removing all html markup

Answers (2)

Related Questions