eat-sleep-code
eat-sleep-code

Reputation: 4855

Extracting text contents of web-hosted PDF using C#?

In C# (ASP.NET MVC5), I simply need to be able to extract the text contents from a web-hosted PDF and return them as a string.

I see plenty of (probably-old) examples of how to do this with a local file, but none of one that is web-hosted.

Anyone have any ideas?

Upvotes: 0

Views: 280

Answers (1)

Clay07g
Clay07g

Reputation: 1105

The thing about web-hosted files is that you cannot see their contents unless your machine has a copy of that file. Even when you open a PDF file in your browser, it still downloads it to your machine, even if temporarily.

Therefore, a program cannot read a file it does not have.

So, you need to download the file into your filesystem, then reference it.

You could use the WebClient class to accomplish this:

using System.Net;
//...
WebClient Client = new WebClient ();
Client.DownloadFile("http://website.com/mypdf.pdf", @"filepath.pdf");

From there, you can use one of those algorithms on "filepath.pdf", display the text, then delete that file.

Note: Webclient is disposable. Make sure to dispose of it or make use of the using keyword.

Fair warning: I'm not a security expert, but I would try to find ways to ensure the files aren't malicious, and ensure your PDF Reader algorithm accounts for this, or your application is specific to websites you know don't host malware.

Upvotes: 1

Related Questions