Apache spark text similarity

Question

I am trying the below example in java

Efficient string matching in Apache Spark

This is my code

public class App {
    public static void main(String[] args) {
        System.out.println("Hello World!");

        System.setProperty("hadoop.home.dir", "D:\del");

        List firstRow = new ArrayList();
        firstRow.add(new App().new MyRecord("1", "Love is blind"));

        List secondRow = new ArrayList();
        secondRow.add(new App().new MyRecord("1", "Luv is blind"));

        SparkSession spark = SparkSession.builder().appName("LSHExample").config("spark.master", "local")
                .getOrCreate();

        Dataset firstDataFrame = spark.createDataFrame(firstRow, MyRecord.class);
        Dataset secondDataFrame = spark.createDataFrame(secondRow, MyRecord.class);

        firstDataFrame.show(20, false);
        secondDataFrame.show(20, false);

        RegexTokenizer regexTokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
                .setPattern("\W");
        NGram ngramTransformer = new NGram().setN(3).setInputCol("words").setOutputCol("ngrams");
        HashingTF hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("vectors");
        MinHashLSH minHashLSH = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh");

        Pipeline pipeline = new Pipeline()
        .setStages(new PipelineStage[] { regexTokenizer, ngramTransformer, hashingTF, minHashLSH });

        PipelineModel model = pipeline.fit(firstDataFrame);

        Dataset dataset1 = model.transform(firstDataFrame);
        dataset1.show(20,false);

        Dataset dataset2 = model.transform(secondDataFrame);
        dataset2 .show(20,false);

        Transformer[] transformers = model.stages();
        MinHashLSHModel temp = (MinHashLSHModel) transformers[transformers.length - 1];
        temp.approxSimilarityJoin(dataset1, dataset2, 0.01).show(20,false);

    }

    protected class MyRecord {
        private String id;
        private String text;

        private MyRecord(String id, String text) {
            this.id = id;
            this.text = text;
        }

        public String getId() {
            return id;
        }

        public String getText() {
            return text;
        }

    }

}

Before invoking the approxSimilarityJoin the two datasets look like below.

Transformed Dataset A

+---+-------------+-----------------+---------------+-----------------------+----------------+
|id |text         |words            |ngrams         |vectors                |lsh             |
+---+-------------+-----------------+---------------+-----------------------+----------------+
|1  |Love is blind|[love, is, blind]|[love is blind]|(262144,[243005],[1.0])|[[2.02034596E9]]|
+---+-------------+-----------------+---------------+-----------------------+----------------+

Transformed Dataset B

+---+------------+----------------+--------------+----------------------+----------------+
|id |text        |words           |ngrams        |vectors               |lsh             |
+---+------------+----------------+--------------+----------------------+----------------+
|2  |Luv is blind|[luv, is, blind]|[luv is blind]|(262144,[57733],[1.0])|[[7.79808048E8]]|
+---+------------+----------------+--------------+----------------------+----------------+

Though the two texts "Love is blind" and "Luv is blind" are almost similar , I get the below blank output.

+--------+--------+-------+
|datasetA|datasetB|distCol|
+--------+--------+-------+
+--------+--------+-------+

Kindly revert If there is any mistake in the above code .

I tested by giving the same input for both data sets and below is the output. The distCol is zero when both the datasets have same text.

+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
|datasetA                                                                                                                        |datasetB                                                                                                                        |distCol|
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
|[1,Love is blind,WrappedArray(love, is, blind),WrappedArray(love is blind),(262144,[243005],[1.0]),WrappedArray([2.02034596E9])]|[2,Love is blind,WrappedArray(love, is, blind),WrappedArray(love is blind),(262144,[243005],[1.0]),WrappedArray([2.02034596E9])]|0.0    |
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+

The below example also uses the same concept.

https://databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html

I think I am missing some fundamental aspect in this program. Kindly revert.

It worked based upon the suggestions given by user8371915 .

I removed the ngram and increased the numHashTables

MinHashLSH minHashLSH = new MinHashLSH().setInputCol("features").setOutputCol("hashValues").setNumHashTables(20);

Now I am able to relate how this matching works

Below are my two dataset

Dataset A

+---+-------------+
|id |text         |
+---+-------------+
|1  |Love is blind|
+---+-------------+

Dataset B

+---+-------------------------+
|id |text                     |
+---+-------------------------+
|1  |Love is blind            |
|2  |Luv is blind             |
|3  |Lov is blind             |
|4  |This is totally different|
|5  |God is love              |
|6  |blind love is divine     |
+---+-------------------------+

and the final output is below

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|datasetA                                                                                                                                                                                                                                                                                                                                                                                                                                                             |datasetB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |distCol|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]                            |0.0    |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[2,Luv is blind,WrappedArray(luv, is, blind),(262144,[15889,48831,84987],[1.0,1.0,1.0]),WrappedArray([-2.021501434E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-6.70773282E8], [-6.93210471E8], [-1.205754635E9], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [4.46435174E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.036250081E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]                              |0.5    |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[5,God is love,WrappedArray(god, is, love),(262144,[15889,57304,186480],[1.0,1.0,1.0]),WrappedArray([-7.6253133E7], [-2.6669178E7], [-1.590526534E9], [-2.83593282E8], [-1.060055906E9], [-1.411500923E9], [-9.83191394E8], [-8.0411681E7], [-1.04032919E9], [-1.373403353E9], [-5.63413619E8], [-1.240833109E9], [-1.48476096E8], [-1.7390215E9], [-1.745820849E9], [8.1559665E7], [-1.997519365E9], [-1.635066748E9], [6.38995945E8], [-1.59718287E9])]                                        |0.5    |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[6,blind love is divine,WrappedArray(blind, love, is, divine),(262144,[15889,25596,48831,186480],[1.0,1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-1.627956291E9], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.93451596E9], [-1.882820721E9], [-7.50906814E8], [-1.152091375E9], [-1.997519365E9], [-1.380314819E9], [-8.50494401E8], [-1.869738298E9])]|0.25   |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[3,Lov is blind,WrappedArray(lov, is, blind),(262144,[15889,48831,81946],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.88316392E9], [-1.776275893E9], [-6.93210471E8], [-1.39927757E8], [-1.713286948E9], [-1.698342316E9], [-1.164990332E9], [-1.240833109E9], [-1.519529732E9], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.036250081E9], [-1.380314819E9], [-1.808919173E9], [-1.869738298E9])]                            |0.5    |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+

Alper t. Turker · Accepted Answer

I have a few suggestions:

If you use NGrams consider more granular tokenizer. The goal here is to correct for misspellings:

RegexTokenizer regexTokenizer = new RegexTokenizer()
   .setInputCol("text")
   .setOutputCol("words")
   .setPattern("");

NGram ngramTransformer = new NGram()
  .setN(3)
  .setInputCol("words")
  .setOutputCol("ngrams");

With your current code (NGram(3) and sentence three words split by \W) three you'll get only one token and no similarity.

Increase number of hash tables (setNumHashTables) for LSH. Default value (1) is to small for anything but simple examples.
Normalize Unicode strings. There is a Scala Transformer in What is the best way to remove accents with apache spark dataframes in PySpark?

Remove capitalization. You can use SQLTransformer:

import org.apache.spark.ml.feature.SQLTransformer

val sqlTrans = new SQLTransformer().setStatement(
   "SELECT *, lower(normalized_text) FROM __THIS__")

Apache spark text similarity

Answers (1)

Related Questions