Michael Adams
Michael Adams

Reputation:

Programatically Break Apart a PDF created by a scanner into separate PDF documents

I have PDF documents from a scanner. This PDF contain forms filled out and signed by staff for a days work. I want to place a bar code or standard area for OCR text on every form type so the batch scan can be programatically broken apart into separate PDF document based on form type.

I would like to do this in Microsoft .net 2.0

I can purchase the require Adobe or other namespaces/dll need to accomplish the task if there are no open source namespaces/dll's available.

Upvotes: 2

Views: 3035

Answers (6)

almog.ori
almog.ori

Reputation: 7889

check out the Tesseract .NET wrapper (v 2.04.0) around the c++ ocr engine by the same name developed by hp in the late 90's, it won awards for its ingenuity

Upvotes: 0

Drejc
Drejc

Reputation: 14286

You can use several, try these free tools:

Upvotes: 0

joshperry
joshperry

Reputation: 42307

From the title of your question I'm assuming that you just need to break apart PDF files and that they are already OCR'd. There are a few open source .NET PDF libraries out there. I have successfully used PDFSharp in a project of my own.

Here is a quick snippet that shows how to cull out each page from a PDF document using PDFSharp:

string filePath = @"c:\file.pdf";

using (PdfDocument ipdf = PdfReader.Open(filePath, PdfDocumentOpenMode.ReadOnly))
{
    int i = 1;
    foreach (PdfPage page in ipdf.Pages)
    {
        using (PdfDocument opdf = new PdfDocument())
        {
            opdf.Version = ipdf.Version;
            opdf.AddPage(page);

            opdf.Save("page " + i++ + ".pdf");
        }
    }
}

Assuming also that you need to access the text in the document for grouping you can use the PdfPage.Contents property.

Upvotes: 1

StingyJack
StingyJack

Reputation: 19479

iTextSharp will help you split, reassemble, and apply barcodes to pdf's in .NET languages. I dont think it can OCR a document, but I havent looked (I used Abby fine Reader engine).

Upvotes: 1

Will Rickards
Will Rickards

Reputation: 2786

You can research the iTextSharp library, which can split pdf files. But it isn't very good for reading the actual pdfs. So I have no idea how it would know where to split them.

There are companies that already do this for you. You can research the kwiktag company.

Upvotes: 1

Brian Genisio
Brian Genisio

Reputation: 48147

Not a free or open source option, but you might also look at ABCPdf by webSuperGoo as another alternative to Adobe.

Upvotes: 2

Related Questions