darego101
darego101

Reputation: 329

extracting check-box data from PDFs with Azure Read/OCR API

I have 1000s of survey forms which I need to scan and then upload onto my C# system in order to extract the data and enter it into a database. The surveys are a mix of hand-written 1) text boxes and 2) checkboxes. I am currently using the the Azure Read Api to extract hand-written text which should work fine e.g. question #4 below returns 'Python' and 'coding'.

So my question; will any Azure API (Read or OCR etc.) give me the capability to extract data for which checkbox is marked? e.g. see question #1 below - I need a string back saying 'disagree', is this possible with any Azure API or will I need to look elsewhere? If so, what API or library can I use to get hand-written checkbox data?

Can somebody with iText7 or IronOCR tell me if these libraries would allow me to extract the checkbox data below?

Survey Example:

enter image description here

Upvotes: 1

Views: 1604

Answers (1)

David C
David C

Reputation: 531

The answer for this isn't overly straightforward and involves creating custom code to parse the PDF yourself via a 3rd party library.

Since your forms are of a known shape, you know the locations of the checkboxes. You should construct a dictionary of "Checkbox name" and "Checkbox data" for each checkbox on the page. The data object could be an object that looks like:

public class CheckboxData {
    public int startX { get; set; }
    public int startY { get; set; }
    public int endX { get; set; }
    public int endY { get; set; }
    public bool IsChecked { get; set; }
}

I would recommend using IronOCR to rasterize the PDF to an Image.

With your image, iterate over the checkbox dictionary and using the bounding points, move pixel by pixel and get the colour of the pixel. Store the colours in a list and then get the average colour of all pixels within the checkbox. If the average is above a threshold value for determining whether it's checked, set the IsChecked boolean.

For radio styled checkboxes, you will probably need a different data object and store the centre pixel of the circle. For the circles, you should store the centreX and centreY, along with the radius of the circle and use Bresenham Circle algorithm to know what pixels around that to check.

Below is an example of getting the pixel coordinates in GIMP for where the cursor is. Getting pixel coordinates of image file

Upvotes: 2

Related Questions