Reputation: 87
I have extracted text for multiple file formats(pdf,html,doc) using below code(using tika)
File file1 = new File("c://sample.pdf);
InputStream input = new FileInputStream(file1);
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
JSONObject obj = new JSONObject();
obj.put("Content",handler.toString());
Now my requirement is to get the frequently occurring words from the extracted content, can u please suggest me how to do this.
Thanks
Upvotes: 1
Views: 956
Reputation: 33359
Here's a function to the most frequent word.
You need to pass the content to the function, and you get the frequently occurring word.
String getMostFrequentWord(String input) {
String[] words = input.split(" ");
// Create a dictionary using word as key, and frequency as value
Map<String, Integer> dictionary = new HashMap<String, Integer>();
for (String word : words) {
if (dictionary.containsKey(word)) {
int frequency = dictionary.get(word);
dictionary.put(word, frequency + 1);
} else {
dictionary.put(word, 1);
}
}
int max = 0;
String mostFrequentWord = "";
Set<Entry<String, Integer>> set = dictionary.entrySet();
for (Entry<String, Integer> entry : set) {
if (entry.getValue() > max) {
max = entry.getValue();
mostFrequentWord = entry.getKey();
}
}
return mostFrequentWord;
}
The algorithm is O(n) so the performance should be okay.
Upvotes: 4