Rohan K
Rohan K

Reputation: 177

How to know the Image or Picture Location while parsing MS Word Doc in java using apache poi

HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();

The above statement gives the list of all pictures inside a document. I want to know after which text/position in the doc the image will be located at?

Upvotes: 0

Views: 3217

Answers (2)

Mjid Elm
Mjid Elm

Reputation: 1

You Should add PicturesSourceClass

public class PicturesSource {

private PicturesTable picturesTable;
private Set<Picture> output = new HashSet<Picture>();
private Map<Integer, Picture> lookup;
private List<Picture> nonU1based;
private List<Picture> all;
private int pn = 0;

public PicturesSource(HWPFDocument doc) {
    picturesTable = doc.getPicturesTable();
    all = picturesTable.getAllPictures();


    lookup = new HashMap<Integer, Picture>();
    for (Picture p : all) {
        lookup.put(p.getStartOffset(), p);
    }


    nonU1based = new ArrayList<Picture>();
    nonU1based.addAll(all);
    Range r = doc.getRange();
    for (int i = 0; i < r.numCharacterRuns(); i++) {
        CharacterRun cr = r.getCharacterRun(i);
        if (picturesTable.hasPicture(cr)) {
            Picture p = getFor(cr);
            int at = nonU1based.indexOf(p);
            nonU1based.set(at, null);
        }
    }
}


private boolean hasPicture(CharacterRun cr) {
    return picturesTable.hasPicture(cr);
}

private void recordOutput(Picture picture) {
    output.add(picture);
}

private boolean hasOutput(Picture picture) {
    return output.contains(picture);
}

private int pictureNumber(Picture picture) {
    return all.indexOf(picture) + 1;
}

public Picture getFor(CharacterRun cr) {
    return lookup.get(cr.getPicOffset());
}


private Picture nextUnclaimed() {
    Picture p = null;
    while (pn < nonU1based.size()) {
        p = nonU1based.get(pn);
        pn++;
        if (p != null) return p;
    }
    return null;
}

}

Upvotes: 0

Gagravarr
Gagravarr

Reputation: 48326

You're getting at the pictures the wrong way, which is why you're not finding any positions!

What you need to do is process each CharacterRun of the document in turn. Pass that to the PicturesTable, and check if the character run has a picture in. If it does, fetch back the picture from the table, and you know where in the document it belongs as you have the run it comes from

At the simplest, it'd be something like:

PicturesSource pictures = new PicturesSource(document);
PicturesTable pictureTable = document.getPicturesTable();

Range r = document.getRange();
for(int i=0; i<r.numParagraphs(); i++) {
    Paragraph p = r.getParagraph(i);
    for(int j=0; j<p.numCharacterRuns(); j++) {
      CharacterRun cr = p.getCharacterRun(j);
      if (pictureTable.hasPicture(cr)) {
         Picture picture = pictures.getFor(cr);
         // Do something useful with the picture
      }
    }
}

You can find a good example of doing this in the Apache Tika parser for Microsoft Word .doc, which is powered by Apache POI

Upvotes: 2

Related Questions