Andrew Rehn
Andrew Rehn

Reputation: 11

In Adobe Javascript, how can I search a page for underlined or strikethru text?

Using javascript in Adobe Acrobat, I'm trying to identify each page in a pdf that has strikethru or underline, and save those pages into a new document. However, the this.getPageNthWord(p, n) seems to just return text, so I'm unable to detect if it has underline or strikethru.

Is there a version of getPageNthWord that returns rich text? This seems like it should be trivial. The documentation for adobe javascript is rather miserable, but maybe my googling is just failing me.

Here's what I've got so far:

var pageArray = [];
for (var p = 0; p < this.numPages; p++) {
    // iterate over all words
    for (var n = 0; n < this.getPageNumWords(p); n++) {
    console.println(this.getPageNthWord(p, n))
        if ((this.getPageNthWord(p, n).style.strikeThru === true) || (this.getPageNthWord(p, n).style.underline === true)) {
            pageArray.push(p);
            break;
        }
    }
}

if (pageArray.length > 0) {
    // extract all pages that contain the string into a new document
    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done
    for (var n = 0; n < pageArray.length; n++) {
        d.insertPages({
            nPage: d.numPages - 1,
            cPath: this.path,
            nStart: pageArray,
            nEnd: pageArray,
        });
    }
    // remove the first page
   d.deletePages(0);
}

Edit: I was told in the question staging that this isn't possible in Adobe, but maybe someone knows an answer. But if you're searching this question or a similar one yourself, and there isn't a better answer below, the reason is that Adobe doesn't track this kind of information associated with the text. It is "rich text" like I assume above. So Adobe has no idea if the text is underlined. (Maybe)

Upvotes: 1

Views: 50

Answers (1)

K J
K J

Reputation: 11857

A common request when working with PDF is how to detect Font styles like the colour or boldness or is it italic or underlined /struck through etc.

The answer is you need to consider those are source attributes when programming a PDF charted page that are not needed in an Adobe PDF format thus usually discarded.

Take this simplified example of a binary file made from 0 and 1 that resemble human text from a "Word Processor" format.

enter image description here

Luckily we can retain a human readable PDF page chart as Adobe text like this.

%PDF-1.7
2 0 obj <</Count 1/Kids[3 0 R]/Type/Pages>> endobj
4 0 obj <</BaseFont/Helvetica/Encoding/WinAnsiEncoding/Subtype/TrueType/Type/Font>> endobj
3 0 obj <</Contents 5 0 R/MediaBox[0 0 595 420]/Parent 2 0 R/Resources<</Font<</F0 4 0 R/F1 4 0 R>>>>/Type/Page>> endobj
5 0 obj <</Length 241>> stream
1 0 0 1 72 353 cm 0 g BT /F0 14 Tf [(u)1(n)3(d)2(e)0(r)2(l)1(i)0(n)3(e)] TJ ET
1 0 0 1 125 -0 cm BT /F1 14 Tf (strike-through) Tj ET
1 0.0 0.5 1 118 -0 cm 1 0 0 RG 1 w BT 2 Tr /F0 14 Tf (oblique) Tj ET
-248 -2.8 70 1 re f -128 2.8 100 1 re f
endstream
endobj
1 0 obj <</Pages 2 0 R/Type/Catalog>> endobj
xref
0 6
0000000000 65536 f 
0000000544 00000 n 
0000000009 00000 n 
0000000151 00000 n 
0000000060 00000 n 
0000000272 00000 n 
trailer
<</Size 6/Root 1 0 R/ID[<04C6CA97EF7D410CBE0367F920112706><EFADF3C661AFE2335B415110BCB9838B>]>>
startxref
0607
%%EOF

The point to note about a PDF is. The objects do not need to be in a human order as the indexing of components ensures a program running sequence.

So the displayed styled text is assigned plain single byte values that may be lexical characters like above where the spaced "kerned" underlined text is (u)1(n)3(d)2(e)0(r)2(l)1(i)0(n)3(e) but does not need to be spaced as seen with other lexical characters such as the upright black oblique text.

Thus "underline" is composed of 9 x 3 byte words (#) and the other 2 words are separate lines of plain text.

So what applies styles such as italic or bold or U͟n͟d͟e͟r͟l͟i͟n͟e͟ is the page description of an area in shorthand formatting.

The example above, shows two fonts in use. However they are exactly the same object "Upright Helvetica". There is no red text, nor italic, nor underlined, etc. The text, same as here, is plain, but prefixed with the Font short code /F0 and /F1 (Here both the same object). What affects the text appearance is the page area graphics state is described, at times, to be coloured with lines of red or thicker or transformed as slanted. And as for the underline and strike through, those are simply black rectangles added later.

Upvotes: 1

Related Questions