Reputation: 15
i have text files as shown below
ex:
file 1:
yamaha
gladiator
bike
file 2:
bajaj
pulsar
bike
file 3:
yamaha
gladiator
india
i have to read these file indivisually and create clusters. means to say, from above ex, file 1 and file 3 are similar and will create one cluster. i want atleast a single word to be matched between two files to make a cluster. so finally i have to get two clusters from above ex as 1: yamaha and 2: bajaj. pls help me with this....
Upvotes: 1
Views: 977
Reputation: 5115
Sounds like you simply need to read each file into a Set<String>
of words and then looking for intersections to build your clusters. That could be achieved, for example, by building a map of words to a count of occurrences (Map<String, Integer>
) or a map of words to a set of filenames (Map<String, Set<String>>
).
Not sure where you second example cluster comes from as "bajaj" only exists in file 2.
EDIT: based on request to explain how Maps and Sets work
Instantiating a Map that maps strings (the word) to a set of filenames:
Map<String, Set<String>> wordsToFilenames = new HashMap<String, Set<String>>();
Adding a word found in a filename to this (assume we've read in a word from the file into the word variable and have the filename in a filename variable, both Strings):
Set<String> filenamesForWord;
if (wordsToFilenames.containsKey(word)) {
filenamesForWord = wordsToFilenames.get(word);
}
else {
filenamesForWord = new HashSet<String>();
wordsToFilenames.put(word, filenamesForWord);
}
filenamesForWord.add(filename);
Upvotes: 1
Reputation: 2303
You can look at the naïve Bayesian classifier which does quite well in document classification. For other algorithms, try googling text classification algorithm.
Upvotes: 0