Reputation: 51

Searching for a string in a pdf files

I am working on a school project that has several pdf files. There should be a search by name functionality that I just type in the student's name and all the pdf files with his/her name should open. What is the best way to do this? I've looked for solutions on the net and all I am coming up with is iTextSharp and it's making more confused.

Is this possible? Maybe someone can please give me a link to a tutorial, or something. :) Thank you very much.

Upvotes: 5

Answers (5)

K J

Reputation: 11940

Depending on your system this task can be trivial.

For Windows User workstations or Database Servers you use an iFilter with cache indexing, this will become the fastest method over time.

Traditionally Acrobat will search multiple files indexed internally:-

If you work with large numbers of related PDFs, you can define them as a catalog in Acrobat Pro, which generates a PDF index for the PDFs. Searching the PDF index—instead of the PDFs themselves—dramatically speeds up searches.

On Windows you can install any iFilter and use Windows native file search without Pro or even without Acrobat just the search bar, it too can be quicker than a full slow search.

Or there are many applications to hybrid search for PDF files with a text string, some can also cache results for later use. See tools like Everything (not indexed) or AgentRansack (indexed)

Systems without iFilter will need a different approach

Linux & (cyg)Windows - pdfgrep

Windows CLI/GUI - dnGrep

But Simplest for a cross platform caller on a small corpus of PDFs, is loop a directory or file list with pdftotext using an OS specific pipe

Upvotes: 0

Wolfgang Grinfeld

Reputation: 1028

I tend to use Apache PDFBox for that (written in java, but usable in .Net 5+ as well as the .net framework).

To use it, NuGet IKVM:

Install-Package IKVM -Version 8.2.0

Download the required jar files and reference them in your project:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
    <ImplicitUsings>disable</ImplicitUsings>
    <Nullable>disable</Nullable>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IKVM" Version="8.2.0" />
  </ItemGroup>

  <ItemGroup>
    <IkvmReference Include="commons-logging-1.2.jar" />
    <IkvmReference Include="fontbox-3.0.0-alpha3.jar" >
        <References>commons-logging-1.2.jar</References>
    </IkvmReference>
    <IkvmReference Include="pdfbox-3.0.0-alpha3.jar" >
        <References>commons-logging-1.2.jar;fontbox-3.0.0-alpha3.jar</References>
    </IkvmReference>
   </ItemGroup>

</Project>

Then use PDFBox in C#:

using org.apache.pdfbox.io;
using org.apache.pdfbox.pdfparser;
using org.apache.pdfbox.text;

public class Program
{
    public static string getTextFromPdf(string pdfPath)
    {
        using(var input = new RandomAccessReadBufferedFile(pdfPath))
        {
            var parser = new PDFParser(input);
            var pdDoc = parser.parse();
            var pdfStripper = new PDFTextStripper();
            return pdfStripper.getText(pdDoc);
        }
    }

    public static void Main(string[] args)
    {
        var res = getTextFromPdf(@"C:\Temp\test.pdf");
        System.Console.WriteLine(res);
    }
}

The returned string can then be searched using RegularExpressions or similar.

Upvotes: 0

Bobrovsky

Reputation: 14246

I think your task may be split as follows:

Build index of PDF files
Write some code that will use the index to locate relevant PDF whenever a search performed
Write some code that will open found PDF or show a warning if nothing was found

To build index you may use some integrated solution like Apache Lucene or Lucene.Net or convert each PDF into text and build index from the text yourselves.

Other two steps are fairly trivial and depend on language/technology used in first step.

Your question is tagged as related to .NET, so you may try Docotic.Pdf library for index building (disclaimer: I work for Bit Miracle).

Docotic.Pdf may be used to extract text from PDF files as plain text or as collection of text chunks with coordinates for each chunk.

Upvotes: 2

Carter Medlin

Reputation: 12495

Use iTextSharp. It's free and you only need the "itextsharp.dll".

http://sourceforge.net/projects/itextsharp/

Here is a simple function for reading the text out of a PDF.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

Now you can search through those files with ease.

Upvotes: 4

peter.murray.rust

Reputation: 38071

PDF is a very complex specification and it is possible to create so many variants that it is impossible to parse reliably unless you use the same tools to read it as were used to create it (and often not even then). There are several tools which flatten PDF to a text string (e.g. pdf2text) and it may be possible to search these but it's unreliable.

Many PDF tools only implement some of the spec. Some people suggest that the best way to search PDF is to reduce it to an image and then OCR that.

Upvotes: 2

Searching for a string in a pdf files

Answers (5)

Related Questions