Reputation: 41
Problem Statement: I am unable to read data from a PDF file using SAS.
What worked well: I am able to download the PDF from the website and save it.
Not working (Need Help): I am not able to read the data from a PDF file using SAS. The source content structure is expected to remain the same always. Expected Output is attached as a jpg image.
It would be a great learning and help if someone knows and help me how to tackle this scenario by using SAS program.
I tried something like this:
/*Proxy address*/
%let proxy_host=xxx.com;
%let port=123;
/*Output location*/
filename output "/desktop/Response.pdf";
/*Download the source file and save it in the desired location*/
proc http
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"
method="get"
proxyhost="&proxy_host."
proxyport=&port
out=output;
run;
%let lineSize = 2000;
data base;
format text_line $&lineSize..;
infile output lrecl=&lineSize;
input text_line $;
run;
DATA _NULL_ ;
X "PS2ASCII /desktop/Response.pdf
/desktop/flatfile.txt";
RUN;
Upvotes: 2
Views: 3445
Reputation: 27508
You can use Apache PDFBox® library which is an open source Java tool for working with PDF documents. The library can be utilized from within SAS Proc GROOVY
with Java code that strips text and it's position on page from a PDF document.
Example:
You will have to write more code to make a data set from the stripped text.
filename overview "overview.pdf";
filename ov_text "overview.txt";
* download a pdf document;
proc http
url="https://cdn.nar.realtor/sites/default/files/documents/ehs-10-2020-overview-2020-11-19_0.pdf"
method="get"
/*proxyhost="&proxy_host." */
/*proxyport=&port */
out=overview;
run;
* download the Apache PDFBox library (a .jar file);
filename jar 'pdfbox.jar';
%if %sysfunc(FEXIST(jar)) ne 1 %then %do;
proc http
url='https://www.apache.org/dyn/closer.lua?filename=pdfbox/2.0.21/pdfbox-app-2.0.21.jar&action=download'
out=jar;
run;
%end;
* Use GROOVY to read the PDF, strip out the text and position, and write that
* parse to a text file which SAS can read;
proc groovy classpath="pdfbox.jar";
submit
"%sysfunc(pathname(overview))" /* the input, a pdf file */
"%sysfunc(pathname(ov_text))" /* the output, a text file */
;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
import java.io.FileWriter;
import java.io.PrintWriter;
public class GetLinesFromPDF extends PDFTextStripper {
static List<String> lines = new ArrayList<String>();
public GetLinesFromPDF() throws IOException {
}
/**
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException {
PDDocument document = null;
PrintWriter out = null;
String inPdf = args[0];
String outTxt = args[1];
try {
document = PDDocument.load( new File(inPdf) );
PDFTextStripper stripper = new GetLinesFromPDF();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
out = new PrintWriter(new FileWriter(outTxt));
// print lines to text file
for(String line:lines){
out.println(line);
}
}
finally {
if( document != null ) {
document.close();
}
if( out != null ) {
out.close();
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String places = "";
for(TextPosition tp:textPositions){
places += "(" + tp.getX() + "," + tp.getY() + ") ";
}
lines.add(str + " found @ " + places);
}
}
endsubmit;
quit;
* preview the stripped text that was saved;
data _null_;
infile ov_text;
input;
putlog _infile_;
run;
/*
* additional SAS code will be needed to input the text as data
* and construct a data set that matches the original tabular content layout
*/
Upvotes: 2