Vishal Sanap
Vishal Sanap

Reputation: 25

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ). I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.

PFA.

Input docx

Current Output

Here is my Code

try {
    fileInputStream = new FileInputStream("StartDoc.docx");
    document = new XWPFDocument(fileInputStream);
    XWPFWordExtractor extractor = new XWPFWordExtractor(document);
    List<XWPFParagraph> paragraph = document.getParagraphs();
    Converter data = new Converter() ;
    for(XWPFParagraph p :document.getParagraphs())
    {           
        for(XWPFRun r :p.getRuns())
        {           
            String string2 = r.getText(0);
            data.uniToShree(string2);
            r.setText(string2,0);
        }
    }
    //Write the Document in file system

    FileOutputStream out = new FileOutputStream(new File("Output.docx");
    document.write(out);
    out.close();
    System.out.println("Output.docx written successully");

} 
catch (IOException e) {
    System.out.println("We had an error while reading the Word Doc");
}

Upvotes: 1

Views: 1251

Answers (1)

Chandni Verma
Chandni Verma

Reputation: 11

Thank you for ask-an-answer. I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.

The Java compiler is smart enough to suggest good debugging information in itself! A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.

Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get. Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!

Share your findings from the above method calls to further help you help debug the issue.

[EDIT 1:]

So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file. (thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )

Now, there are 2 elements to a Word document:

  1. Content
  2. Structure or Schema

You are able to convert the data as you are working only on the content of your respective doc files. In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.

MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].

You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].

[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.

[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java

[3]. https://www.w3schools.com/xml/

My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other. Hope it helps!

Upvotes: 1

Related Questions