Lunayo
Lunayo

Reputation: 538

Getting text in PDF with toUnicode

I am working in a PDF project, where I need to grab all text from the PDF. I've got some problem decoding Identity-H Font using toUnicode dictionary table provide from the PDF itself. the toUnicode provide character mapping to unicode hex, but didn't provide the uppercase CID character to unicode (in table).. So is there way that can lowercase the input unichar before process mapping to unicode using the table?

Can I using the offset between the <000C> <0042> to calculate the uppercase character?

toUnicode table .

57 beginbfchar
<0001> <0020>
<0002> <0021>
<0003> <0026>
<0004> <2019>
<0005> <002C>
<0006> <002D>
<0007> <002E>
<0008> <003A>
<0009> <003F>
<000A> <0040>
<000B> <0041>
<000C> <0042>
<000D> <0043>
<000E> <0044>
<000F> <0045>
<0010> <0046>
<0011> <0047>
<0012> <0048>
<0013> <0049>
<0014> <004A>
<0015> <004B>
<0016> <004C>
<0017> <004D>
<0018> <004F>
<0019> <0050>
<001A> <0052>
<001B> <0053>
<001C> <0054>
<001D> <0055>
<001E> <0057>
<001F> <0059>
<0020> <2018>
<0021> <0061>
<0022> <0062>
<0023> <0063>
<0024> <0064>
<0025> <0065>
<0026> <0066>
<0027> <0067>
<0028> <0068>
<0029> <0069>
<002A> <006A>
<002B> <006B>
<002C> <006C>
<002D> <006D>
<002E> <006E>
<002F> <006F>
<0030> <0070>
<0031> <0072>
<0032> <0073>
<0033> <0074>
<0034> <0075>
<0035> <0077>
<0036> <0079>
<0037> <007A>
<0038> <FB01>
<0039> <00FC>
endbfchar

the table did not provide glyph that mapping to uppercase Character. So how to show the character?

Upvotes: 1

Views: 1414

Answers (1)

Lunayo
Lunayo

Reputation: 538

I Solved the problem, the problem is in CGPDFStringCopyTextString(). this method get the string from CGPDFStringRef got some weird bytes that I didn't want. So instead of that I tried get the byte manual by using

NSMutableString *unicodeString = [NSMutableString string];
    for (NSUInteger i = 0; i < [data length]; i++) {
        unsigned char byte;
        [data getBytes:&byte range:NSMakeRange(i, 1)];
        unichar unicodeChar = byte;
        [unicodeString appendFormat:@"%c",unicodeChar];
    }
return unicodeString;

Upvotes: 1

Related Questions