Reputation: 37
I've started using PDFClown some weeks ago. My purpose is multi-word highlighting, mainly on newspapers. Starting from the org.pdfclown.samples.cli.TextHighlightSample
example, I succeeded in extracting multi-word positions and highlighting them. I even solved some problems due to text ordering and matching in most cases.
Unfortunately my framework includes FPDI and it does not consider PDFAnnotations
. So, all the content outside of a page content stream, like text annotations and other so called markup annotations, get lost.
So any suggestion on creating "Text Highlighting" with PdfClown and without using PDF annotations?
Upvotes: 0
Views: 752
Reputation: 96064
To not have the highlight in an annotation but instead in the actual page content stream, one has to put the graphic commandos into the page content stream which in case of the org.pdfclown.samples.cli.TextHighlightSample
example are implicitly put into the normal annotation appearance stream.
This can be implemented like this:
org.pdfclown.files.File file = new org.pdfclown.files.File(resource);
Pattern pattern = Pattern.compile("S", Pattern.CASE_INSENSITIVE);
TextExtractor textExtractor = new TextExtractor(true, true);
for (final Page page : file.getDocument().getPages())
{
final List<Quad> highlightQuads = new ArrayList<Quad>();
Map<Rectangle2D, List<ITextString>> textStrings = textExtractor.extract(page);
final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
textExtractor.filter(textStrings, new TextExtractor.IIntervalFilter()
{
@Override
public boolean hasNext()
{
return matcher.find();
}
@Override
public Interval<Integer> next()
{
return new Interval<Integer>(matcher.start(), matcher.end());
}
@Override
public void process(Interval<Integer> interval, ITextString match)
{
{
Rectangle2D textBox = null;
for (TextChar textChar : match.getTextChars())
{
Rectangle2D textCharBox = textChar.getBox();
if (textBox == null)
{
textBox = (Rectangle2D) textCharBox.clone();
}
else
{
if (textCharBox.getY() > textBox.getMaxY())
{
highlightQuads.add(Quad.get(textBox));
textBox = (Rectangle2D) textCharBox.clone();
}
else
{
textBox.add(textCharBox);
}
}
}
highlightQuads.add(Quad.get(textBox));
}
}
@Override
public void remove()
{
throw new UnsupportedOperationException();
}
});
// Highlight the text pattern match!
ExtGState defaultExtGState = new ExtGState(file.getDocument());
defaultExtGState.setAlphaShape(false);
defaultExtGState.setBlendMode(Arrays.asList(BlendModeEnum.Multiply));
PrimitiveComposer composer = new PrimitiveComposer(page);
composer.getScanner().moveEnd();
// TODO: reset graphics state here.
composer.applyState(defaultExtGState);
composer.setFillColor(new DeviceRGBColor(1, 1, 0));
{
for (Quad markupBox : highlightQuads)
{
Point2D[] points = markupBox.getPoints();
double markupBoxHeight = points[3].getY() - points[0].getY();
double markupBoxMargin = markupBoxHeight * .25;
composer.drawCurve(new Point2D.Double(points[3].getX(), points[3].getY()),
new Point2D.Double(points[0].getX(), points[0].getY()),
new Point2D.Double(points[3].getX() - markupBoxMargin, points[3].getY() - markupBoxMargin),
new Point2D.Double(points[0].getX() - markupBoxMargin, points[0].getY() + markupBoxMargin));
composer.drawLine(new Point2D.Double(points[1].getX(), points[1].getY()));
composer.drawCurve(new Point2D.Double(points[2].getX(), points[2].getY()),
new Point2D.Double(points[1].getX() + markupBoxMargin, points[1].getY() + markupBoxMargin),
new Point2D.Double(points[2].getX() + markupBoxMargin, points[2].getY() - markupBoxMargin));
composer.fill();
}
}
composer.flush();
}
file.save(new File(RESULT_FOLDER, "multiPage-highlight-content.pdf"), SerializationModeEnum.Incremental);
(HighlightInContent.java method testHighlightInContent)
You will recognize the text extraction frame from the original example. Merely now the quads from a whole page are collected before they are processed, and the processing code (which mostly has been borrowed from TextMarkup.refreshAppearance()
) draws forms representing the quads into the page content.
Beware, to make this work generically, the graphics state has to be reset before inserting the new instructions (the position is marked with a TODO
comment). This can be done either by applying save/restore state or by actually counteracting unwanted changed state entries. Unfortunately I did not see how to do the former in PDF Clown and have not yet had the time to do the latter.
Upvotes: 2