jaccard similarity using cartesian

Question

I have this piece of code:

StructType schema = new StructType(
                new StructField[] { DataTypes.createStructField("file_path", DataTypes.StringType, false),
                        DataTypes.createStructField("file_content",
                                DataTypes.createArrayType(DataTypes.StringType, false), false) });

Dataset df = spark.createDataFrame(shinglesDocs.map(new Function, Row>() {
            @Override
            public Row call(Tuple2 record) {
                return RowFactory.create(record._1().substring(record._1().lastIndexOf("/") + 1), record._2());
            }
        }), schema);
        df.show(true);

CountVectorizer vectorizer = new CountVectorizer().setInputCol("file_content").setOutputCol("feature_vector")
                .setBinary(true);
        CountVectorizerModel cvm = vectorizer.fit(df);
        Broadcast vocabSize = sc.broadcast(cvm.vocabulary().length);
        System.out.println("vocab size = " + cvm.vocabulary().length;
        for (int i = 0; i < vocabSize.value(); i++) {
            System.out.print(cvm.vocabulary()[i] + "(" + i + ") ");
        }
        System.out.println();

        Dataset characteristicMatrix = cvm.transform(df);
        characteristicMatrix.show(false);

cm contains = [ column-for-document1, column-for-document-2, column-for-document3 ]

where column-for-document1 looks like this (1, 0, 1, 1, 0, 0, 1, 1 )

I need to calculate JS=a/(a+b+c)

the Jaccard Similarity (JS) between column-for-document1 and column-for-document2
the Jaccard Similarity (JS) between column-for-document1 and column-for-document3
the Jaccard Similarity (JS) between column-for-document2 and column-for-document3

but cm is a big file, it is on 3 different computers (because it is big data programming), so,

column-for-document1 is on one computer; column-for-document2 is on another computer; column-for-document3 is on the 3rd computer

if they are all on different computers, how can you calculate the above?

I need to use cartesian for this

cm.cartesian(cm)

but I'm not even sure where to begin since the cm is in the dataset. I thought that maybe if I could convert it into an array and then compare the indexes but I've never worked with datasets before so I don't know how to do it or what would be the best strategy for this.

Please write your answer in java spark.

jaccard similarity using cartesian

Answers (1)

Related Questions