Reputation: 15
I have this piece of code:
StructType schema = new StructType(
new StructField[] { DataTypes.createStructField("file_path", DataTypes.StringType, false),
DataTypes.createStructField("file_content",
DataTypes.createArrayType(DataTypes.StringType, false), false) });
Dataset<Row> df = spark.createDataFrame(shinglesDocs.map(new Function<Tuple2<String, String[]>, Row>() {
@Override
public Row call(Tuple2<String, String[]> record) {
return RowFactory.create(record._1().substring(record._1().lastIndexOf("/") + 1), record._2());
}
}), schema);
df.show(true);
CountVectorizer vectorizer = new CountVectorizer().setInputCol("file_content").setOutputCol("feature_vector")
.setBinary(true);
CountVectorizerModel cvm = vectorizer.fit(df);
Broadcast<Integer> vocabSize = sc.broadcast(cvm.vocabulary().length);
System.out.println("vocab size = " + cvm.vocabulary().length;
for (int i = 0; i < vocabSize.value(); i++) {
System.out.print(cvm.vocabulary()[i] + "(" + i + ") ");
}
System.out.println();
Dataset<Row> characteristicMatrix = cvm.transform(df);
characteristicMatrix.show(false);
cm contains = [ column-for-document1, column-for-document-2, column-for-document3 ]
where column-for-document1 looks like this (1, 0, 1, 1, 0, 0, 1, 1 )
I need to calculate JS=a/(a+b+c)
but cm is a big file, it is on 3 different computers (because it is big data programming), so,
column-for-document1 is on one computer; column-for-document2 is on another computer; column-for-document3 is on the 3rd computer
if they are all on different computers, how can you calculate the above?
I need to use cartesian for this
cm.cartesian(cm)
but I'm not even sure where to begin since the cm is in the dataset. I thought that maybe if I could convert it into an array and then compare the indexes but I've never worked with datasets before so I don't know how to do it or what would be the best strategy for this.
Please write your answer in java spark.
Upvotes: 0
Views: 220
Reputation: 81
This seems to be the ideal situation for the MinHash algorithm.
This algorithm lets you take in a stream of data (such as from 3 different computers) and using a number of hash functions calculate the similarity between streams, the jaccard similarity.
You can find an implementation of the MinHash algorithm on the spark wiki here: http://spark.apache.org/docs/2.2.3/ml-features.html#minhash-for-jaccard-distance
Upvotes: 1