Jean-Paul
Jean-Paul

Reputation: 21180

Removing all html markup

I have a string that holds a complete XML get request.

In the request, there is a lot of HTML and some custom commands which I would like to remove.

The only way of doing so I know is by using jSoup.

For example like so.

Now, because the website the request came from also features custom commands, I was not able to completely remove all code.

For example here is a string I would like to 'clean':

\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here

As you can see, the custom commands all have backslashes in front of them.

How would I go about removing these commands with Java?

If I use regex, how can I program it such that it only removes the command, not anything after the command? (because if I softcode: I don't know the size of the command beforehand and I don't want to hardcode all the commands).

Upvotes: 0

Views: 352

Answers (2)

Floris
Floris

Reputation: 46435

See http://regex101.com/r/gJ2yN2

The regex (\\.\d{3,}.*?\s|(\\r|\\n)+) works to remove the things you were pointing out.

Result (replacing the match with a single space):

normal text here http://a_random_link_here.com Some more text here

If this was not the result you were looking for, please edit your question with the expected result.

EDIT regex explained:

()  - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\  - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
      so it is not interpreted as a special character
.   - any character (I saw 'u', but there might be others
\d  - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s  - until you hit a white space
|   - or
()  - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n

Upvotes: 1

keshlam
keshlam

Reputation: 8058

The "custom commands" you're showing us appear to be standard character escapes. \r is carriage return, ASCII 13 (decimal). \n is new line, ASCII 10 (decimal). \uxxxx is generally an escape for the Unicode character with that hex value -- for example, \u0027 is ASCII character 39, the apostrophe character ('). You don't want to discard these; they're part of the text content you're trying to retrieve.

So the best answer is to make sure you know which escapes to accept in this dataset and then either find or write code which does a quick linear scan through the code looking for \ and, when found, using the next character to determine which kind of escape it is (and how many subsequent characters belong to that kind of escape), replace the escape sequence with the single character it represents, and continue until you reach the end of the string/buffer/file/whatever.

Upvotes: 0

Related Questions