Reputation: 8090
Hello i have trouble with parsing a pdf when the iterator reaches page 11 an exception is throw.
Any ideas? Thanks
Here's my code:
import java.io.*;
import java.nio.charset.Charset;
import java.util.regex.*;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.hyphenation.TernaryTree.Iterator;
import com.lowagie.text.pdf.parser.PdfTextExtractor;
public class PdfParser {
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
int index = 0;
try {
PdfReader readerN = new PdfReader("C:\\Documents and Settings\\stefan.stere\\hibernateWorkspace\\PdfParser\\src\\monitor3.pdf");
OutputStreamWriter out = new OutputStreamWriter( new FileOutputStream(new File("C:\\Documents and Settings\\stefan.stere\\hibernateWorkspace\\PdfParser\\src\\pdf2txt.rtf")),"Cp1252");
PdfTextExtractor parse = new PdfTextExtractor(readerN);
int nrPages = readerN.getNumberOfPages();
for (int i=1; i<nrPages ; i++) {
index++;
String page = parse.getTextFromPage(i);
if(page != null){
page = page.replace(new StringBuffer("null"), new StringBuffer("??"));
page = page.replaceAll("Comercial.", "Comerciala");
page = page.replaceAll("ACT ADI..IONAL", "ACT ADITIONAL");
page = page.replaceAll("HOT.R..E", "HOTARARE");
page = page.replaceAll("HOT.R..EA", "HOTARAREA");
page = page.replaceAll("HOT.R..I", "HOTARARI");
page = page.replaceAll("..cheiat.", "incheiata");
page = page.replaceAll("ANUN..", "ANUNT");
out.write(page);
System.out.println(page);
}
}
out.close();
readerN.close();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
System.out.println(index);
}
}
}
and the exception stack:
java.lang.ArrayIndexOutOfBoundsException: Invalid index: 62
at com.lowagie.text.pdf.CMapAwareDocumentFont.decodeSingleCID(Unknown Source)
at com.lowagie.text.pdf.CMapAwareDocumentFont.decode(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.decode(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.displayPdfString(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor$ShowTextArray.invoke(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at PdfParser.main(PdfParser.java:32)
Upvotes: 0
Views: 2638
Reputation: 8771
As info, now iText 5.0.6 works perfect with more PDF versions, I've tested PDF generated with many different programs and 2.1.7 had problems on many.
Get the last version from iText download
Upvotes: 1
Reputation: 12780
No answer but it seems a lot of people have the same problem, there's another related question on SO. And if you search with ArrayIndexOutOfBoundsException and getTextFromPage on google you see the same problem but no solution...
BTW, your loop will stop before processing the last page as the first page has index 1...
Upvotes: 1