user2366149
user2366149

Reputation:

Need suggestion regarding clustering in Mahout

I followed Reuters Data set Clutering example from book "Mahout in Action" and sucessfully tested. To understand more about clustering,I tried same sequence to cluster some tweets data.

Following sequence of commands I used:

mahout seqdirectory -c UTF-8 -i hdfs://-----:8020/user/hdfs/tweet/tweet.txt -o hdfs://-----:8020/user/hdfs/tweet/seqfiles

mahout seq2sparse -i hdfs://-----:8020/user/hdfs/tweet/seqfiles -o hdfs://----:8020/user/hdfs/tweet/vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv

mahout kmeans -i hdfs://---:8020/user/hdfs/tweet/vectors/tfidf-vectors/ -c kmeans-centroids -cl -o hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters -k 3 -ow -x 3 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

mahout clusterdump -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusters-3-final -d hdfs://----:8020/user/hdfs/tweet/vectors/dictionary.file-0 -dt sequencefile -b 100 -n 10 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints -o tweet_outdump.txt

tweet_outdump.txt file contains following data:

CL-0{n=1 c=[] r=[]}
Top Terms: 
Weight : [props - optional]: Point:
1.0: /tweet.txt =]
Inter-Cluster Density: NaN
Intra-Cluster Density: 0.0
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: NaN
CDbw Separation: 0.0

Even I tried, this command:

mahout seqdumper -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints/part-m-00000

Key: 0: Value: 1.0: /tweet.txt =]
Count: 1

I would really appreciate some feedback here. Thanks in advance

Upvotes: 0

Views: 570

Answers (2)

user2386829
user2386829

Reputation: 1

Regarding your case, it seems that your data set consists of only one large file, where each line of your file represents (say document or file). So, in this case Seqdirectory command will generate a sequential file which contains just one , which is not appropriate as I understood from your post. Thus, you should first, write a simple MapReduce code that takes your data set and assigns an id to each line of your data. Here, you can use the line offset as an Id (key) and the value is the line itself. Also, you have to specify your outputformat as Sequential. Another thing, your outputkey has to be Text and your value is UTF-8 encoded string wraped in a Text object. Here is a simple MapReduce code to do:

public class TexToHadoopSeq {

    // Class Map1
    public static class mapper extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, Text> {

        Text cle = new Text();
        Text valeur = new Text();

        @Override
        public void map(LongWritable key, Text values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {

            String record = values.toString();

            byte[] b = record.getBytes("UTF-8");

            valeur.set(b);
            cle.set(key.toString());
            output.collect(cle, valeur);

        }
    }

    // Class Reducer
    public static class Reduce1 extends MapReduceBase implements
            Reducer<Text, Text, Text, Text> {

        @Override
        public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            while (values.hasNext()) {

                output.collect(key, values.next());

            }

        }

    }

    public static void main(String[] args) throws IOException {

        String inputdata = args[0];


        System.out.println();
        System.out.println();

        // Start Job1
        JobClient client1 = new JobClient();
        JobConf conf1 = new JobConf(TexToHadoopSeq.class);

        FileInputFormat.setInputPaths(conf1, new Path(inputdata));// database
        FileOutputFormat.setOutputPath(conf1, new Path("output"));// Sortie Job1

        conf1.setJarByClass(TexToHadoopSeq.class);
        conf1.setMapperClass(mapper.class);
        conf1.setReducerClass(Reduce1.class);

        conf1.setNumReduceTasks(1);

        conf1.setMapOutputKeyClass(Text.class);
        conf1.setMapOutputValueClass(Text.class);

        conf1.setOutputKeyClass(Text.class);
        conf1.setOutputValueClass(Text.class);
        conf1.setInputFormat(TextInputFormat.class);
        conf1.setOutputFormat(SequenceFileOutputFormat.class);
        client1.setConf(conf1);
        RunningJob Job;
        Job = JobClient.runJob(conf1);
        Job.waitForCompletion();

        System.out.println();
        System.out.println();
        System.out.print("*****Conversion is Done*****");

    }

}

Now, next step is to create vectors from your sequence file (generated by the above code), thus use: ./mahout seq2sparse -i "Directory of your sequential file in HDFS" -o "output" --maxDFPercent 85 --namedVector

Then, you'll get the TFIDF directory ... and then continue performing Kmeans or whatever mahout clustering algorithm. That's it.

Upvotes: 0

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You created a data set that consists of a single document only.

Obviously, clustering results are not meaningful. There is no "inter cluster distance" (because there is only one cluster). And the in-cluster distance is 0, because there is only one object, and it has a trivial distance of 0 to itself.

So you failed already at the seqdirectory command - you passed a file, not a directory with one file per document...

Upvotes: 1

Related Questions