Reputation: 467
I'm extracting text from a PDF document. This PDF was generated using a WS reading Data From AS400 . So when printing text, the output is like :
orem ipsum dolor sit amet, **«VS123»** In eros risus, «VS124» sed felis quis, commodo interdum tellus. Donec vitae massa
And «VS123» , «VS124» are variables from AS400.The Java APi is not able to read Value from variable and its printing Variable name instead of variable values.
I'm using PDFBox https://pdfbox.apache.org/ to extract text. The code source is like :
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.pdmodel.interactive.form.PDNonTerminalField;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
public class App
{
public static void main( String[] args ) throws IOException
{
try (PDDocument document = PDDocument.load(new File("C:/my.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
// split by whitespace
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
document.close();
}
}
}
}
The output starts with this stack of error :
AVERTISSEMENT: Invalid ToUnicode CMap in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+77 (77) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+111 (111) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+110 (110) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+116 (116) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+97 (97) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+32 (32) in font ArialMT
I'tried also to exract text using iText :
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import java.io.IOException;
public class App {
private static final String FILE_NAME = "C:/my.pdf";
public static void main(String[] args) {
PdfReader reader;
try {
reader = new PdfReader(FILE_NAME);
String textFromPage = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println(textFromPage);
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Here is the part of the PDF document :
When Tryin to extract text, or with Copy-paste, The output will be this :
CLIENT N° «VS35» « VS36 » CONTRAT N° «VS28»
The link to the PDF File: https://drive.google.com/file/d/1RNea028nCReIVS8nRWNlBwUwBsDOhDYg/view?usp=sharing
Upvotes: 0
Views: 1221
Reputation: 18861
The variables are rendered white in the PDF, as can be seen with PDFDebugger (excerpt from the second content stream of page 1):
BT
/F3 9 Tf
1 0 0 1 70.944 30.6 Tm
1 g
1 G
[ (\253) ] TJ
ET
BT
1 0 0 1 75.984 30.6 Tm
[ (VS1) -2 (1) -3 (3) ] TJ
ET
"1 g" is maximum from /DeviceGray so that is white. So that part puts out "«VS113".
The values come much later in the PDF... One of them appears at the end of the content stream of the XObject form (a sequence of PDF operations) "X2":
BT
1.0 0.0 0.0 1.0 153.3 457.35144 Tm
0.0 3.57696 Td
0 Tr
/DeviceRGB cs
0.0 0.0 0.0 sc
/TCCZPJ+ArialMT 11.04 Tf
[ (\0003\0001\0008\000 \0009\0007\0008\000 \0000\0001\0002) ] TJ
0.0 -3.57696 Td
ET
"0.0 0.0 0.0 sc" means black, and the next-to-next line has 318 978 012. This can't be extracted due to an error reading the /ToUnicode stream. That stream should map each code to a unicode but that is missing. (You may think that it is visually obvious here, but things are not always so).
The only thing that is weird is that Adobe Reader gets the values.
From looking at the components of the PDF, it seems that in the first step, a PDF is generated with these "variables" printed white on white. In a second step, a second software finds these variables and prints the actual text at their place.
Upvotes: 2
Reputation: 1187
AFAIK, the PDF doesn't contain variable data as displayed in the text. If there are any variables there, they might have converted to be used by it's own interactivity interface. (e.g. SVG interactivity).
So when the PDF was generated, the variable names were converted to string and the actual variable data might have been renamed.
Upvotes: 0