algorytmus
algorytmus

Reputation: 973

Azure Computer Vision returns garbage for a pdf with vector graphics

Azure Computer Vision (OCR) API returns garbage for a fragment of a pdf sent. Pdf has a visible text, let's say: 4893759678 but in fact it is vector graphics (not a text).

When I select the graphics, copy it and paste it to notepad it is something like: (85;9r?A>?EV. For some parts of the pdf where there are images with numbers (raster graphics) it does analyses ok; it returns ocr text.

How to fix it or how to instruct Azure to do ocr for vector graphics. I cannot change pdfs themselves easily.

By the way I am looking for a job as Azure developer (.NET) :)

Upvotes: 1

Views: 290

Answers (1)

Ecstasy
Ecstasy

Reputation: 1864

Thank you K J. Posting your suggestion as an answer to help other community members.

You can not normally cut and paste bits of pdf especially binary ones, the whole file must be decrypted disassembled decoded reconstructed as objects and those reassembled into pages then you can copy parts of the page. OCR is pixel dissection analysis and reconstruction thus should be a non lossy (not jpg) pixel image of the vectors

You can refer to Azure Read API for Vector PDFs, Optical character recognition Read API and How to extract images from PDF files using c# and itextsharp

Upvotes: 2

Related Questions