Reputation:
I followed Reuters Data set Clutering example from book "Mahout in Action" and sucessfully tested. To understand more about clustering,I tried same sequence to cluster some tweets data.
Following sequence of commands I used:
mahout seqdirectory -c UTF-8 -i hdfs://-----:8020/user/hdfs/tweet/tweet.txt -o hdfs://-----:8020/user/hdfs/tweet/seqfiles
mahout seq2sparse -i hdfs://-----:8020/user/hdfs/tweet/seqfiles -o hdfs://----:8020/user/hdfs/tweet/vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv
mahout kmeans -i hdfs://---:8020/user/hdfs/tweet/vectors/tfidf-vectors/ -c kmeans-centroids -cl -o hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters -k 3 -ow -x 3 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
mahout clusterdump -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusters-3-final -d hdfs://----:8020/user/hdfs/tweet/vectors/dictionary.file-0 -dt sequencefile -b 100 -n 10 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir hdfs://-----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints -o tweet_outdump.txt
tweet_outdump.txt file contains following data:
CL-0{n=1 c=[] r=[]}
Top Terms:
Weight : [props - optional]: Point:
1.0: /tweet.txt =]
Inter-Cluster Density: NaN
Intra-Cluster Density: 0.0
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: NaN
CDbw Separation: 0.0
Even I tried, this command:
mahout seqdumper -i hdfs://----:8020/user/hdfs/tweet/kmeans-clusters/clusteredPoints/part-m-00000
Key: 0: Value: 1.0: /tweet.txt =]
Count: 1
I would really appreciate some feedback here. Thanks in advance
Upvotes: 0
Views: 570
Reputation: 1
Regarding your case, it seems that your data set consists of only one large file, where each line of your file represents (say document or file). So, in this case Seqdirectory command will generate a sequential file which contains just one , which is not appropriate as I understood from your post. Thus, you should first, write a simple MapReduce code that takes your data set and assigns an id to each line of your data. Here, you can use the line offset as an Id (key) and the value is the line itself. Also, you have to specify your outputformat as Sequential. Another thing, your outputkey has to be Text and your value is UTF-8 encoded string wraped in a Text object. Here is a simple MapReduce code to do:
public class TexToHadoopSeq {
// Class Map1
public static class mapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {
Text cle = new Text();
Text valeur = new Text();
@Override
public void map(LongWritable key, Text values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
String record = values.toString();
byte[] b = record.getBytes("UTF-8");
valeur.set(b);
cle.set(key.toString());
output.collect(cle, valeur);
}
}
// Class Reducer
public static class Reduce1 extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
while (values.hasNext()) {
output.collect(key, values.next());
}
}
}
public static void main(String[] args) throws IOException {
String inputdata = args[0];
System.out.println();
System.out.println();
// Start Job1
JobClient client1 = new JobClient();
JobConf conf1 = new JobConf(TexToHadoopSeq.class);
FileInputFormat.setInputPaths(conf1, new Path(inputdata));// database
FileOutputFormat.setOutputPath(conf1, new Path("output"));// Sortie Job1
conf1.setJarByClass(TexToHadoopSeq.class);
conf1.setMapperClass(mapper.class);
conf1.setReducerClass(Reduce1.class);
conf1.setNumReduceTasks(1);
conf1.setMapOutputKeyClass(Text.class);
conf1.setMapOutputValueClass(Text.class);
conf1.setOutputKeyClass(Text.class);
conf1.setOutputValueClass(Text.class);
conf1.setInputFormat(TextInputFormat.class);
conf1.setOutputFormat(SequenceFileOutputFormat.class);
client1.setConf(conf1);
RunningJob Job;
Job = JobClient.runJob(conf1);
Job.waitForCompletion();
System.out.println();
System.out.println();
System.out.print("*****Conversion is Done*****");
}
}
Now, next step is to create vectors from your sequence file (generated by the above code), thus use: ./mahout seq2sparse -i "Directory of your sequential file in HDFS" -o "output" --maxDFPercent 85 --namedVector
Then, you'll get the TFIDF directory ... and then continue performing Kmeans or whatever mahout clustering algorithm. That's it.
Upvotes: 0
Reputation: 77454
You created a data set that consists of a single document only.
Obviously, clustering results are not meaningful. There is no "inter cluster distance" (because there is only one cluster). And the in-cluster distance is 0, because there is only one object, and it has a trivial distance of 0 to itself.
So you failed already at the seqdirectory
command - you passed a file, not a directory with one file per document...
Upvotes: 1