Sachin Raj
Sachin Raj

Reputation: 15

text files clustering

i have text files as shown below

ex:

file 1:

       yamaha
       gladiator 
       bike  

file 2:

       bajaj 
       pulsar
       bike

file 3:

       yamaha 
       gladiator
       india

i have to read these file indivisually and create clusters. means to say, from above ex, file 1 and file 3 are similar and will create one cluster. i want atleast a single word to be matched between two files to make a cluster. so finally i have to get two clusters from above ex as 1: yamaha and 2: bajaj. pls help me with this....

Upvotes: 1

Views: 977

Answers (2)

John Pickup
John Pickup

Reputation: 5115

Sounds like you simply need to read each file into a Set<String> of words and then looking for intersections to build your clusters. That could be achieved, for example, by building a map of words to a count of occurrences (Map<String, Integer>) or a map of words to a set of filenames (Map<String, Set<String>>).

Not sure where you second example cluster comes from as "bajaj" only exists in file 2.

EDIT: based on request to explain how Maps and Sets work

Instantiating a Map that maps strings (the word) to a set of filenames:

Map<String, Set<String>> wordsToFilenames = new HashMap<String, Set<String>>();

Adding a word found in a filename to this (assume we've read in a word from the file into the word variable and have the filename in a filename variable, both Strings):

Set<String> filenamesForWord;

if (wordsToFilenames.containsKey(word)) {
    filenamesForWord = wordsToFilenames.get(word);
}
else {
    filenamesForWord = new HashSet<String>();
    wordsToFilenames.put(word, filenamesForWord);
}

filenamesForWord.add(filename);

Upvotes: 1

krookedking
krookedking

Reputation: 2303

You can look at the naïve Bayesian classifier which does quite well in document classification. For other algorithms, try googling text classification algorithm.

Upvotes: 0

Related Questions