Reputation: 821
I am working on Document management project and I want to extract text from pdf. How can I achieve this. I am using Itextsharp to extract pdf on local system
This is a function I am using for this purpose. Path is a FTP Server Path
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
}
It throws an exception
'ftp:\\###\index\500199.pdf not found as file or resource.'
[### is my ftp server]
Upvotes: 1
Views: 1534
Reputation: 55437
PdfReader
has a bunch of constructor overloads but most of them rely on RandomAccessSourceFactory
to convert whatever is passed in into a Stream
format. When you pass a string
in it is checked if it is a file on disk and if not it is checked if it can be converted to a Uri
as one of file:/
, http://
or https://
link. This is your first point of failure because none of these checks handle the ftp protocol and you ultimately end up at a local resource loader which doesn't work for you.
You could try converting your string
to an explicit Uri
but that actually won't work, either:
//This won't work
new PdfReader(new Uri(path))
The reason that this won't work is because iText tells .Net to use CredentialCache.DefaultCredentials
when loading remote resources however that concept doesn't exist in the FTP world.
Long story short, when using FTP you'll want to download the files on your own. Depending on their size you'll want to either download them to disk or download them a byte array. Below is a sample of the latter:
Byte[] bytes;
if( path.StartsWith(@"ftp://")) {
var wc = WebRequest.Create(path);
using (var response = wc.GetResponse()) {
using (var responseStream = response.GetResponseStream()) {
bytes = iTextSharp.text.io.StreamUtil.InputStreamToArray(responseStream);
}
}
}
You can then pass either the local file or the byte array to the PdfReader
constructor.
Upvotes: 2