Reputation: 1
I would like to know if there is some coincidence between words of two different long strings with SPARK (Java Api).
String string1 = "car bike bus ..." (about 100 words);
String string2 = "boat plane car ..." (about 100 words);
How could I do this??
I have created an approach but I think it´s not efficient (too many iterations):
List<String> a1 = new ArrayList<>();
List<String> a2 = new ArrayList<>();
a1.add("car");
a1.add("boat");
a1.add("bike");
a2.add("car");
a2.add("nada");
a2.add("otro");
JavaRDD<String> rdd = jsc.parallelize(a1);
JavaRDD<String> counts = rdd.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String s) throws Exception {
Boolean occurrence = false;
for(int i=0; i<a2.size(); i++) {
if(StringUtils.containsIgnoreCase(s, a2.get(i))) {
System.out.println("encontrado");
occurrence = true;
break;
}
}
return occurrence;
}
});
System.out.println(counts.count());
Upvotes: 0
Views: 286
Reputation: 15297
You can use intersect
method which is available for both RDD and Dataset. Below is the sample using Spark 2.0, Java and Dataset.
public class SparkIntersection {
public static void main(String[] args) {
//SparkSession
SparkSession spark = SparkSession
.builder()
.appName("SparkIntersection")
.config("spark.sql.warehouse.dir", "/file:C:/temp")
.master("local[*]")
.getOrCreate();
//List
List<String> data1 = Arrays.asList("one","two","three","four","five");
List<String> data2 = Arrays.asList("one","six","three","nine","ten");
//Dataset
Dataset<String> ds1 = spark.createDataset(data1, Encoders.STRING());
Dataset<String> ds2 = spark.createDataset(data2, Encoders.STRING());
//Intersect
Dataset<String> ds = ds1.intersect(ds2);
ds.show();
//stop
spark.stop();
}
}
Upvotes: 0