Camillo
Camillo

Reputation: 544

Organizing and managing thousands of PDF-Files in PHP and MySQL

I am helping a former teacher of mine to set up a website where he can exchange class documents (exams, exercise-sheets for students etc.) with his colleagues. He has personally created thousands of PDF-Files, which will now be available to other teachers for reference / usage.

One main feature would be a search function, which will allow users to search for specific files. As there are so many documents, we need to come up with an efficient way to search through all documents.

I have thought of several approaches:

a) Assign every PDF-File 5-10 keywords manually, and save those in the MySQL database along with the file's metadata. The user would be searching for those keywords, and not the PDF's content directly.

b) Use some sort of logic to extract the 10-20 most frequent keywords programmatically, and save those in the MySQL database along with the file's metadata. This is in my opinion a better approach than a).

c) Extract a large portion / all of the PDF-Files text content using file_get_contents and save those in the MySQL database along with the file's metadata. The user is now able to perform searches on the actual text content itself. In my opinion, this would be the best approach.

d) any other approach not mentioned by me?

I am not sure about the viability of those approaches (i.e. will c) consume many resources server-side? In fact we would be sifting through thousands of database rows with each hundreds of words in extracted text-content).

I hope you can give me some pointers on whether I am on the right track, and what in your opinion the best approach would be. Thanks a lot in advance!

Upvotes: 2

Views: 879

Answers (1)

Alternatex
Alternatex

Reputation: 1552

Approach (a) is your answer (in my opinion). Searching through all the file content is not viable in practice. Extracting the 10-20 most frequent words will only mislead your searching as there is zero guarantee those words will make sense in describing the document they're from. Extracting a large portion of the text could be useful but searching will be a lot slower and there's no say whether it will make the search better or worse than the one with keywords.

Everything aside, this is largely opinion based. There's no right or wrong way to go about it and approach (a) makes the most sense to me.

Upvotes: 1

Related Questions