text files clustering

Question

i have text files as shown below

ex:

file 1:

       yamaha
       gladiator 
       bike

file 2:

       bajaj 
       pulsar
       bike

file 3:

       yamaha 
       gladiator
       india

i have to read these file indivisually and create clusters. means to say, from above ex, file 1 and file 3 are similar and will create one cluster. i want atleast a single word to be matched between two files to make a cluster. so finally i have to get two clusters from above ex as 1: yamaha and 2: bajaj. pls help me with this....

John Pickup · Accepted Answer

Sounds like you simply need to read each file into a Set of words and then looking for intersections to build your clusters. That could be achieved, for example, by building a map of words to a count of occurrences (Map) or a map of words to a set of filenames (Map>).

Not sure where you second example cluster comes from as "bajaj" only exists in file 2.

EDIT: based on request to explain how Maps and Sets work

Instantiating a Map that maps strings (the word) to a set of filenames:

Map> wordsToFilenames = new HashMap>();

Adding a word found in a filename to this (assume we've read in a word from the file into the word variable and have the filename in a filename variable, both Strings):

Set filenamesForWord;

if (wordsToFilenames.containsKey(word)) {
    filenamesForWord = wordsToFilenames.get(word);
}
else {
    filenamesForWord = new HashSet();
    wordsToFilenames.put(word, filenamesForWord);
}

filenamesForWord.add(filename);

text files clustering

Answers (2)

Related Questions