Reputation: 307

Read .doc file content and write into pdf file in java

I'm writing a java code that utilizes Apache-poi to read ms-office .doc file and itext jar API's to create and write into pdf file. I have done reading texts and tables printed in the .doc file. Now i'm looking for a solution that reads images written in the document. I have coded as following to read images in the document file. Why this code is not working.

public static void main(String[] args) {
    POIFSFileSystem fs = null;  
    Document document = new Document();
    WordExtractor extractor = null ;
    try {
        fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
        HWPFDocument hdocument=new HWPFDocument(fs);
        extractor = new WordExtractor(hdocument);
        OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
        PdfWriter.getInstance(document, fileOutput);
        document.open();
        Range range=hdocument.getRange();
        String readText=null;
        PdfPTable createTable;
        CharacterRun run;
        PicturesTable picture;

        for(int i=0;i<range.numParagraphs();i++) {
            Paragraph par = range.getParagraph(i);
            readText=par.text();
            if(!par.isInTable()) {
                if(readText.endsWith("\n")) {
                    readText=readText+"\n";
                    document.add(new com.itextpdf.text.Paragraph(readText));
                } if(readText.endsWith("\r")) {
                      readText += "\n";
                      document.add(new com.itextpdf.text.Paragraph(readText));
                  }
                run =range.getCharacterRun(i);
                picture=hdocument.getPicturesTable();
                if(picture.hasPicture(run)) {
                //if(run.isSpecialCharacter()) {  
                    Picture pic=picture.extractPicture(run, true);
                    byte[] picturearray=pic.getContent();
                    com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
                    document.add(image);
                }
            } else if (par.isInTable()) { 
                  Table table = range.getTable(par);
                  TableRow tRow1= table.getRow(0);
                  int numColumns=tRow1.numCells();
                  createTable=new PdfPTable(numColumns);
                  for (int rowId=0;rowId<table.numRows();rowId++) {
                      TableRow tRow = table.getRow(rowId);
                      for (int cellId=0;cellId<tRow.numCells();cellId++) {
                          TableCell tCell = tRow.getCell(cellId);
                          PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
                          createTable.addCell(c1);
                      }
                  }
                  document.add(createTable);
              } 
        }
    }catch(IOException e) {
        System.out.println("IO Exception");
        e.printStackTrace();
    }
    catch(Exception exep) {
        exep.printStackTrace();
    }finally {  
        document.close();  
    }  
}

The problems are: 1. Condition if(picture.hasPicture(run)) is not satisfying but document has jpeg image.

I'm getting following exception while reading table.

java.lang.IllegalArgumentException: This paragraph is not the first one in the table at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:876) at pagecode.ReadDocxOrDocFile.main(ReadDocxOrDocFile.java:113)

Can anybody help me to solve the problem. Thank you.

Upvotes: 0

Answers (1)

morido

Reputation: 1017

Regarding your exception:

Your code iterates over all paragraphs and calls isInTable() for each one of them. Since tables are commonly composed of several such paragraphs, your call to getTable() also gets executed several times for a single table.

However, what your code should do instead is to find the first paragraph of a table, then process all paragraphs therein (via getRow(m).getCell(n)) and ultimately continue with the outer loop in the first paragraph after the table. Codewise this may look roughly like the following (assuming no merged cells, no nested tables and no other funny edge cases):

if (par.isInTable()) {
    Table table = range.getTable(par);
    for (int rn=0; rn<table.numRows(); rn++) {
        TableRow row = table.getRow(rn);
        for (int cn=0; cn<row.numCells(); cn++) {
            TableCell cell = row.getCell(cn);
            for (int pn=0; pn<cell.numParagraphs(); pn++) {
                Paragraph cellParagraph = cell.getParagraph(pn);
                // your PDF conversion code goes here
            }
        }
    }
    i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}

Regarding the pictures issue:

Am I guessing right that you are trying to obtain the picture which is anchored within a given paragraph? Unfortunately, the predefined methods of POI only work if the picture is not embedded within a field (which is rather rare, actually). For field-based images (i.e. preview images of embedded OLEs) you should do something like the following (untested!):

PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
    CharacterRun characterRun = par.getCharacterRun(cr);
    Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
    if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"   
        Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
    }
}

For a list of possible values of Field.getType() see here.

Upvotes: 0

Read .doc file content and write into pdf file in java

Answers (1)

Related Questions