Reputation: 55
Good morning
I would like to add HTML as an altChunk to a DOCX file using Apache POI. To do that I followed this stackoverflow answer
How to add an altChunk element to a XWPFDocument using Apache POI
Everything works perfectly except for a problem with special character of my language (italian).
My case is the follow: I have an external html file. To import that I use the following code
byte[] inputBytes = Files.readAllBytes(Paths.get("testo.html"));
String xhtml = new String(inputBytes, StandardCharsets.UTF_8);
Then I generate the docx using the code provided in the stackoverflow answer.
If I unzip the .docx under the "word" folder I have correctly the file "chunk1.html".
If I open it the special caracter are reported correctly, for example
L'attività in oggetto è:
but when I opened the document in Word I see this
L'attività in oggetto è:
Is there same Microsoft Configuration that I missed?
Do I need to specify the character set when I create the chunk?
Upvotes: 0
Views: 950
Reputation: 61915
Microsoft
seems to take ANSI
as the default character encoding for HTML
chunks in Word
. That's annoying as the whole other world takes Unicode (UTF-8
) as the default now.
So we need to set charset for the HTML
explicitly. In the template of the chunk's HTML
do:
...
private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
super(part);
this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>";
this.id = id;
}
...
I would recommend this instead of using ANSI
encoding for the HTML
chunks.
I have edited this into my answer in How to add an altChunk element to a XWPFDocument using Apache POI too.
Upvotes: 1