user3918038
user3918038

Reputation: 3

JSOUP removing words which occur more than once from attributes

Assuming content to be the following:

<p><img src=\"https://abcd.com/pic.jpg\" alt=\"man with a umbrella on terrace with lots of xyz\" width=\"500\" height=\"400\" /></p>

If the following lines of code are used, JSOUP removes words that occur more than once in any attribute.

Parser parser = Parser.htmlParser();

parser.settings(new ParseSettings(true, true));


Document doc = Jsoup.parse(modifiedContent,"",parser);

<p><img src=\"https://abcd.com/pic.jpg\" alt=\"man with a umbrella on terrace lots of xyz\" width=\"500\" height=\"400\" /></p>

the word with is removed. Any suggestions on how to deal with this issue

Upvotes: 0

Views: 53

Answers (1)

Evan Knowles
Evan Knowles

Reputation: 7501

Your input HTML has the initial quotation mark escaped. This means that, instead of your alt being man with a umbrella on terrace with lots of xyz, the value of your alt tag is "man. Following the alt tag you essentially have multiple boolean attributes, being with, a, etc.

JSoup is then stripping out the duplicated boolean attributes, as they have no effect. You should change your HTML to the correct format, without the escaped quotation marks

<p><img src="https://abcd.com/pic.jpg" alt="man with a umbrella on terrace with  lots of xyz" width="500" height="400" /></p>

Running this locally and System.out-ing the doc produces the correct value of

<html>
 <head></head>
 <body>
  <p><img src="https://abcd.com/pic.jpg" alt="man with a umbrella on terrace with lots of xyz" width="500" height="400"></p>
 </body>
</html>

Upvotes: 2

Related Questions