Reputation: 1185
I am trying the below example in java
Efficient string matching in Apache Spark
This is my code
public class App {
public static void main(String[] args) {
System.out.println("Hello World!");
System.setProperty("hadoop.home.dir", "D:\\del");
List<MyRecord> firstRow = new ArrayList<MyRecord>();
firstRow.add(new App().new MyRecord("1", "Love is blind"));
List<MyRecord> secondRow = new ArrayList<MyRecord>();
secondRow.add(new App().new MyRecord("1", "Luv is blind"));
SparkSession spark = SparkSession.builder().appName("LSHExample").config("spark.master", "local")
.getOrCreate();
Dataset firstDataFrame = spark.createDataFrame(firstRow, MyRecord.class);
Dataset secondDataFrame = spark.createDataFrame(secondRow, MyRecord.class);
firstDataFrame.show(20, false);
secondDataFrame.show(20, false);
RegexTokenizer regexTokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
.setPattern("\\W");
NGram ngramTransformer = new NGram().setN(3).setInputCol("words").setOutputCol("ngrams");
HashingTF hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("vectors");
MinHashLSH minHashLSH = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh");
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] { regexTokenizer, ngramTransformer, hashingTF, minHashLSH });
PipelineModel model = pipeline.fit(firstDataFrame);
Dataset dataset1 = model.transform(firstDataFrame);
dataset1.show(20,false);
Dataset dataset2 = model.transform(secondDataFrame);
dataset2 .show(20,false);
Transformer[] transformers = model.stages();
MinHashLSHModel temp = (MinHashLSHModel) transformers[transformers.length - 1];
temp.approxSimilarityJoin(dataset1, dataset2, 0.01).show(20,false);
}
protected class MyRecord {
private String id;
private String text;
private MyRecord(String id, String text) {
this.id = id;
this.text = text;
}
public String getId() {
return id;
}
public String getText() {
return text;
}
}
}
Before invoking the approxSimilarityJoin the two datasets look like below.
Transformed Dataset A
+---+-------------+-----------------+---------------+-----------------------+----------------+
|id |text |words |ngrams |vectors |lsh |
+---+-------------+-----------------+---------------+-----------------------+----------------+
|1 |Love is blind|[love, is, blind]|[love is blind]|(262144,[243005],[1.0])|[[2.02034596E9]]|
+---+-------------+-----------------+---------------+-----------------------+----------------+
Transformed Dataset B
+---+------------+----------------+--------------+----------------------+----------------+
|id |text |words |ngrams |vectors |lsh |
+---+------------+----------------+--------------+----------------------+----------------+
|2 |Luv is blind|[luv, is, blind]|[luv is blind]|(262144,[57733],[1.0])|[[7.79808048E8]]|
+---+------------+----------------+--------------+----------------------+----------------+
Though the two texts "Love is blind" and "Luv is blind" are almost similar , I get the below blank output.
+--------+--------+-------+
|datasetA|datasetB|distCol|
+--------+--------+-------+
+--------+--------+-------+
Kindly revert If there is any mistake in the above code .
I tested by giving the same input for both data sets and below is the output. The distCol is zero when both the datasets have same text.
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
|datasetA |datasetB |distCol|
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
|[1,Love is blind,WrappedArray(love, is, blind),WrappedArray(love is blind),(262144,[243005],[1.0]),WrappedArray([2.02034596E9])]|[2,Love is blind,WrappedArray(love, is, blind),WrappedArray(love is blind),(262144,[243005],[1.0]),WrappedArray([2.02034596E9])]|0.0 |
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
The below example also uses the same concept.
I think I am missing some fundamental aspect in this program. Kindly revert.
It worked based upon the suggestions given by user8371915 .
I removed the ngram and increased the numHashTables
MinHashLSH minHashLSH = new MinHashLSH().setInputCol("features").setOutputCol("hashValues").setNumHashTables(20);
Now I am able to relate how this matching works
Below are my two dataset
Dataset A
+---+-------------+
|id |text |
+---+-------------+
|1 |Love is blind|
+---+-------------+
Dataset B
+---+-------------------------+
|id |text |
+---+-------------------------+
|1 |Love is blind |
|2 |Luv is blind |
|3 |Lov is blind |
|4 |This is totally different|
|5 |God is love |
|6 |blind love is divine |
+---+-------------------------+
and the final output is below
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|datasetA |datasetB |distCol|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])] |0.0 |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[2,Luv is blind,WrappedArray(luv, is, blind),(262144,[15889,48831,84987],[1.0,1.0,1.0]),WrappedArray([-2.021501434E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-6.70773282E8], [-6.93210471E8], [-1.205754635E9], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [4.46435174E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.036250081E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])] |0.5 |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[5,God is love,WrappedArray(god, is, love),(262144,[15889,57304,186480],[1.0,1.0,1.0]),WrappedArray([-7.6253133E7], [-2.6669178E7], [-1.590526534E9], [-2.83593282E8], [-1.060055906E9], [-1.411500923E9], [-9.83191394E8], [-8.0411681E7], [-1.04032919E9], [-1.373403353E9], [-5.63413619E8], [-1.240833109E9], [-1.48476096E8], [-1.7390215E9], [-1.745820849E9], [8.1559665E7], [-1.997519365E9], [-1.635066748E9], [6.38995945E8], [-1.59718287E9])] |0.5 |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[6,blind love is divine,WrappedArray(blind, love, is, divine),(262144,[15889,25596,48831,186480],[1.0,1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-1.627956291E9], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.93451596E9], [-1.882820721E9], [-7.50906814E8], [-1.152091375E9], [-1.997519365E9], [-1.380314819E9], [-8.50494401E8], [-1.869738298E9])]|0.25 |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[3,Lov is blind,WrappedArray(lov, is, blind),(262144,[15889,48831,81946],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.88316392E9], [-1.776275893E9], [-6.93210471E8], [-1.39927757E8], [-1.713286948E9], [-1.698342316E9], [-1.164990332E9], [-1.240833109E9], [-1.519529732E9], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.036250081E9], [-1.380314819E9], [-1.808919173E9], [-1.869738298E9])] |0.5 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
Upvotes: 4
Views: 2579
Reputation: 35229
I have a few suggestions:
If you use NGrams
consider more granular tokenizer. The goal here is to correct for misspellings:
RegexTokenizer regexTokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("");
NGram ngramTransformer = new NGram()
.setN(3)
.setInputCol("words")
.setOutputCol("ngrams");
With your current code (NGram(3)
and sentence three words split by \W
) three you'll get only one token and no similarity.
Increase number of hash tables (setNumHashTables
) for LSH. Default value (1) is to small for anything but simple examples.
Normalize Unicode strings. There is a Scala Transformer
in What is the best way to remove accents with apache spark dataframes in PySpark?
Remove capitalization. You can use SQLTransformer
:
import org.apache.spark.ml.feature.SQLTransformer
val sqlTrans = new SQLTransformer().setStatement(
"SELECT *, lower(normalized_text) FROM __THIS__")
Upvotes: 5