Reputation: 33
I've been trying to parse a pdf-page of text to NSString for a while now and the only thing I can find are methods to search for specific stringvalues.
What I'd like to do is parse a single page of PDF without using any external libraries such as PDFKitten, PDFKit etc.
I'd like to have the data in an NSArray, NSString or NSDictionary if possible.
Thanks :D!
A piece of what I've tried so far.
CGPDFDocumentRef MyGetPDFDocumentRef (const char *filename) {
CFStringRef path;
CFURLRef url;
CGPDFDocumentRef document;
path = CFStringCreateWithCString (NULL, filename,kCFStringEncodingUTF8);
url = CFURLCreateWithFileSystemPath (NULL, path, kCFURLPOSIXPathStyle, 0);
CFRelease (path);
document = CGPDFDocumentCreateWithURL (url);// 2
CFRelease(url);
int count = CGPDFDocumentGetNumberOfPages (document);// 3
if (count == 0) {
printf("`%s' needs at least one page!", filename);
return NULL;
}
return document;
}
// table methods to parse pdf
static void op_MP (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("MP /%s\n", name);
}
static void op_DP (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("DP /%s\n", name);
}
static void op_BMC (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("BMC /%s\n", name);
}
static void op_BDC (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("BDC /%s\n", name);
}
static void op_EMC (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("EMC /%s\n", name);
}
void MyDisplayPDFPage (CGContextRef myContext,size_t pageNumber,const char *filename) {
CGPDFDocumentRef document;
CGPDFPageRef page;
document = MyGetPDFDocumentRef (filename);// 1
totalPages=CGPDFDocumentGetNumberOfPages(document);
page = CGPDFDocumentGetPage (document, 1);// 2
CGPDFDictionaryRef d;
d = CGPDFPageGetDictionary(page);
CGPDFScannerRef myScanner;
CGPDFOperatorTableRef myTable;
myTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback (myTable, "MP", &op_MP);
CGPDFOperatorTableSetCallback (myTable, "DP", &op_DP);
CGPDFOperatorTableSetCallback (myTable, "BMC", &op_BMC);
CGPDFOperatorTableSetCallback (myTable, "BDC", &op_BDC);
CGPDFOperatorTableSetCallback (myTable, "EMC", &op_EMC);
CGPDFContentStreamRef myContentStream = CGPDFContentStreamCreateWithPage (page);// 3
myScanner = CGPDFScannerCreate (myContentStream, myTable, NULL);// 4
CGPDFScannerScan (myScanner);// 5
CGPDFStringRef str;
d = CGPDFPageGetDictionary(page);
if (CGPDFDictionaryGetString(d, "Lorem", &str)){
CFStringRef s;
s = CGPDFStringCopyTextString(str);
if (s != NULL) {
NSLog(@"%@ testing it", s);
}
CFRelease(s);
}
}
- (void)viewDidLoad {
[super viewDidLoad];
MyDisplayPDFPage(UIGraphicsGetCurrentContext(), 1, [[[NSBundle mainBundle] pathForResource:@"TestPage" ofType:@"pdf"] UTF8String]);
}
Upvotes: 2
Views: 1212
Reputation: 31311
Quartz provides functions that let you inspect the PDF document structure and the content stream. Inspecting the document structure lets you read the entries in the document catalog and the contents associated with each entry. By recursively traversing the catalog, you can inspect the entire document.
A PDF content stream is just what its name suggests—a sequential stream of data such as 'BT 12 /F71 Tf (draw this text) Tj . . . ' where PDF operators and their descriptors are mixed with the actual PDF content. Inspecting the content stream requires that you access it sequentially.
This developer.apple documentation shows how to examine the structure of a PDF document and parse the contents of a PDF document.
Upvotes: 4