Munavvar
Munavvar

Reputation: 821

Extract pdf file on ftp server using itextsharp

I am working on Document management project and I want to extract text from pdf. How can I achieve this. I am using Itextsharp to extract pdf on local system

This is a function I am using for this purpose. Path is a FTP Server Path

 public static string ExtractTextFromPdf(string path)
    {
        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
            }

            return text.ToString();
        }
    } 

It throws an exception

'ftp:\\###\index\500199.pdf not found as file or resource.'

[### is my ftp server]

Upvotes: 1

Views: 1534

Answers (1)

Chris Haas
Chris Haas

Reputation: 55437

PdfReader has a bunch of constructor overloads but most of them rely on RandomAccessSourceFactory to convert whatever is passed in into a Stream format. When you pass a string in it is checked if it is a file on disk and if not it is checked if it can be converted to a Uri as one of file:/, http:// or https:// link. This is your first point of failure because none of these checks handle the ftp protocol and you ultimately end up at a local resource loader which doesn't work for you.

You could try converting your string to an explicit Uri but that actually won't work, either:

//This won't work
new PdfReader(new Uri(path))

The reason that this won't work is because iText tells .Net to use CredentialCache.DefaultCredentials when loading remote resources however that concept doesn't exist in the FTP world.

Long story short, when using FTP you'll want to download the files on your own. Depending on their size you'll want to either download them to disk or download them a byte array. Below is a sample of the latter:

Byte[] bytes;
if( path.StartsWith(@"ftp://")) {
    var wc = WebRequest.Create(path);
    using (var response = wc.GetResponse()) {
        using (var responseStream = response.GetResponseStream()) {
            bytes = iTextSharp.text.io.StreamUtil.InputStreamToArray(responseStream);
        }
    }
}

You can then pass either the local file or the byte array to the PdfReader constructor.

Upvotes: 2

Related Questions