Get unique words from Spark Dataset in Java

Question

I'm using Apache Spark 2 to tokenize some text.

Dataset regexTokenized = regexTokenizer.transform(data);

It returns Array of String.

Dataset words = regexTokenized.select("words");

sample data looks like this.

+--------------------+
|               words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|

Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.

Haroun Mohammedi · Accepted Answer

I'm coming from scala but I do believe that there's a similar way in Java.

I think in this case you have to use the explode method in order to transform your data into a Dataset of words.

This code should give you the desired results :

import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()

For information about the explode methode please refer to the official documentation

Hope it helps.

Get unique words from Spark Dataset in Java

Answers (2)

Related Questions