marcoF76IT
marcoF76IT

Reputation: 55

Apache POI docx: HTML as an altChunk

Good morning
I would like to add HTML as an altChunk to a DOCX file using Apache POI. To do that I followed this stackoverflow answer

How to add an altChunk element to a XWPFDocument using Apache POI

Everything works perfectly except for a problem with special character of my language (italian).
My case is the follow: I have an external html file. To import that I use the following code

byte[] inputBytes = Files.readAllBytes(Paths.get("testo.html"));
String xhtml = new String(inputBytes, StandardCharsets.UTF_8);

Then I generate the docx using the code provided in the stackoverflow answer.
If I unzip the .docx under the "word" folder I have correctly the file "chunk1.html".
If I open it the special caracter are reported correctly, for example

L'attività in oggetto è:

but when I opened the document in Word I see this

L'attività in oggetto è: 

Is there same Microsoft Configuration that I missed?
Do I need to specify the character set when I create the chunk?

Upvotes: 0

Views: 950

Answers (1)

Axel Richter
Axel Richter

Reputation: 61915

Microsoft seems to take ANSI as the default character encoding for HTML chunks in Word. That's annoying as the whole other world takes Unicode (UTF-8) as the default now.

So we need to set charset for the HTML explicitly. In the template of the chunk's HTML do:

...
  private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
   super(part);
   this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>";
   this.id = id;
  }
...

I would recommend this instead of using ANSI encoding for the HTML chunks.

I have edited this into my answer in How to add an altChunk element to a XWPFDocument using Apache POI too.

Upvotes: 1

Related Questions