viren v
viren v

Reputation: 11

searching a word in multiple pdf files and indexing pdf based on the word count

Can any one help me to search a word in multiple pdf files and to get the word count?

I need to display the pdf in descending order of word count in each document and I should do this in java.

Upvotes: 1

Views: 6433

Answers (3)

Stephan
Stephan

Reputation: 43013

You can use PDFBox for counting words in the PDF files:

public static int countWordInFile(String word, String filename, String fileEncoding) throws Exception {
    int count=0;
    PrintStream ps = null;
    PrintStream originalSystemOut = System.out;

    try {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ps = new PrintStream(baos);
        System.setOut(ps);

        // Extracting text from page
        ExtractText.main(new String[] {//
                //
                        "-encoding", fileEncoding, //
                        "-console", //
                        filename //
                //
                });

        String content = baos.toString(fileEncoding);

        // TODO: Find the word in content and count its occurences...

    } finally {
        IOUtils.closeQuietly(ps);
        System.setOut(originalSystemOut);
    }

    return count;
}

Upvotes: 2

Villager
Villager

Reputation: 604

It seems like you're looking for a starting point or idea rather than a specific solution - you have a few options here.

First of all you need to make sure that the text content of the PDFs is searchable. Here's one way for example, using Adobe Acrobat.

Secondly, you need to use some kind of API to index the PDF files so that they are searchable. Here's a section on the Apache Lucene site which may give you some hints.

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.

Bear in mind that there isn't much context in your question so indexing the PDFs or Lucene may be overkill for you.

I recommend Googling some approaches - try "text search pdf files", "reading pdf files java", etc.

Here's an another answer to help you out, too.

Upvotes: 1

chris
chris

Reputation: 1785

Getting data:
Download iText (PDF tool), open all pdf's you want to scan, read the text inside of them, make a HashMap to store word -> count(word).

Sorting your hashmap:
this problem is already solved by stackoverflow here: Sort a Map<Key, Value> by values (Java)

Upvotes: 1

Related Questions