Damith Ganegoda
Damith Ganegoda

Reputation: 4328

Get unique words from Spark Dataset in Java

I'm using Apache Spark 2 to tokenize some text.

Dataset<Row> regexTokenized = regexTokenizer.transform(data);

It returns Array of String.

Dataset<Row> words = regexTokenized.select("words");

sample data looks like this.

+--------------------+
|               words|
+--------------------+
|[very, caring, st...|
|[the, grand, cafe...|
|[i, booked, a, no...|
|[wow, the, places...|
|[if, you, are, ju...|

Now, I want to get it all unique words. I tried out a couple of filters, flatMap, map functions and reduce. I couldn't figure it out because I'm new to the Spark.

Upvotes: 0

Views: 1671

Answers (2)

Damith Ganegoda
Damith Ganegoda

Reputation: 4328

based on the @Haroun Mohammedi answer, I was able to figure it out in Java.

Dataset<Row> uniqueWords = regexTokenized.select(explode(regexTokenized.col("words"))).distinct();
uniqueWords.show();

Upvotes: 2

Haroun Mohammedi
Haroun Mohammedi

Reputation: 2424

I'm coming from scala but I do believe that there's a similar way in Java.

I think in this case you have to use the explode method in order to transform your data into a Dataset of words.

This code should give you the desired results :

import org.apache.spark.sql.functions.explode
val dsWords = regexTokenized.select(explode("words"))
val dsUniqueWords = dsWords.distinct()

For information about the explode methode please refer to the official documentation

Hope it helps.

Upvotes: 1

Related Questions