blindado
blindado

Reputation: 43

JSoup Strip html markup from xml

i've been looking stackoverflow but couldn't get anyone with this kind of problem.

I want to do something like this:

Input String:

<?xml version="1.0" encoding="UTF-8" ?>
<List>
  <Object>
    <Section>Fruit</Section>
    <Category>Bananas</Category>
    <Brand>Chiquita</Brand>
    <Obs><p>
Vende-se a pe&ccedil;as ou o conjunto.</p><br>
    </Obs>
  </Object>
</List>

What i want is to strip html tags, like <p>,<br> etc. So it ends like this:

<?xml version="1.0" encoding="UTF-8" ?>
<List>
  <Object>
    <Section>Fruit</Section>
    <Category>Bananas</Category>
    <Brand>Chiquita</Brand>
    <Obs>
Vende-se a pe&ccedil;as ou o conjunto.
    </Obs>
  </Object>
</List>

I have been playing around with JSoup, but i can't seem to make it work properly.

This is the code i have:

Whitelist whitelist = Whitelist.none();
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><List><Object><Section>Fruit</Section><Category>Bananas</Category><Brand>Chiquita</Brand><Obs><p>Vende-se a pe&ccedil;as ou o conjunto.</p><br></Obs></Object></List>";

whitelist.addTags(new String[]{"?xml", "List", "Object", "Section", "Category", "Brand", "Obs"});
String safe = Jsoup.clean(xml, whitelist);

This is the result i am obtaining:

FruitBananasChiquitaVende-se a pe&ccedil;as ou o conjunto.

Thanks in advance

Upvotes: 2

Views: 464

Answers (2)

ollo
ollo

Reputation: 25380

You can use unwrap() to do so:

Example:

    final String input = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n"
            + "<List>\n"
            + "  <Object>\n"
            + "    <Section>Fruit</Section>\n"
            + "    <Category>Bananas</Category>\n"
            + "    <Brand>Chiquita</Brand>\n"
            + "    <Obs><p>\n"
            + "Vende-se a pe&ccedil;as ou o conjunto.</p><br>\n"
            + "    </Obs>\n"
            + "  </Object>\n"
            + "</List>";

    Document doc = Jsoup.parse(input, "", Parser.xmlParser()); // XML-Parser!

    doc.select("p").unwrap(); // unwrapes all p-tags
    doc.select("br").unwrap(); // uńwraps all br-tags

Also it's better to use a XML-Parser instead of a HTML-Parser here.

Output:

<?xml version="1.0" encoding="UTF-8" ?> 
<list> 
 <object> 
  <section>
   Fruit
  </section> 
  <category>
   Bananas
  </category> 
  <brand>
   Chiquita
  </brand> 
  <obs>
    Vende-se a pe&ccedil;as ou o conjunto. 
  </obs> </object> 
</list>

Upvotes: 2

Guy Gavriely
Guy Gavriely

Reputation: 11396

tags are lowercased, use:

whitelist.addTags(new String[] { "?xml", "list", "object", "section",
    "category", "brand", "obs" });

output:

<list>
 <object>
  <section>
   Fruit
  </section>
  <category>
   Bananas
  </category>
  <brand>
   Chiquita
  </brand>
  <obs>
   Vende-se a pe&ccedil;as ou o conjunto.
  </obs></object>
</list>

Upvotes: 4

Related Questions