Reputation: 4328
I'm using Apache Spark 2 to tokenize some text.
Dataset<Row> regexTokenized = regexTokenizer.transform(data);
It returns Array of String.
Dataset<Row> words = regexTokenized.select("words");
sample data looks like this.
+--------------------+
| words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|
Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.
Upvotes: 0
Views: 1671
Reputation: 4328
based on the @Haroun Mohammedi answer, I was able to figure it out in Java.
Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();
Upvotes: 2
Reputation: 2424
I'm coming from scala but I do believe that there's a similar way in Java.
I think in this case you have to use the explode
method in order to transform your data into a Dataset
of words.
This code should give you the desired results :
import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()
For information about the explode methode please refer to the official documentation
Hope it helps.
Upvotes: 1