Reputation: 3
Assuming content to be the following:
<p><img src=\"https://abcd.com/pic.jpg\" alt=\"man with a umbrella on terrace with lots of xyz\" width=\"500\" height=\"400\" /></p>
If the following lines of code are used, JSOUP removes words that occur more than once in any attribute.
Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true));
Document doc = Jsoup.parse(modifiedContent,"",parser);
<p><img src=\"https://abcd.com/pic.jpg\" alt=\"man with a umbrella on terrace lots of xyz\" width=\"500\" height=\"400\" /></p>
the word with is removed. Any suggestions on how to deal with this issue
Upvotes: 0
Views: 53
Reputation: 7501
Your input HTML has the initial quotation mark escaped. This means that, instead of your alt being man with a umbrella on terrace with lots of xyz
, the value of your alt tag is "man
. Following the alt tag you essentially have multiple boolean attributes, being with
, a
, etc.
JSoup is then stripping out the duplicated boolean attributes, as they have no effect. You should change your HTML to the correct format, without the escaped quotation marks
<p><img src="https://abcd.com/pic.jpg" alt="man with a umbrella on terrace with lots of xyz" width="500" height="400" /></p>
Running this locally and System.out-ing the doc produces the correct value of
<html>
<head></head>
<body>
<p><img src="https://abcd.com/pic.jpg" alt="man with a umbrella on terrace with lots of xyz" width="500" height="400"></p>
</body>
</html>
Upvotes: 2