meernet
meernet

Reputation: 386

How to transform Dataset[(String, Seq[String])] to Dataset[(String, String)]?


Probably this's simple problem, but I begin my adventure with spark.

Problem: I'd like to get following structure (Expected result) in spark. Now I have following structure.

title1, {word11, word12, word13 ...}
title2, {word12, word22, word23 ...}

Data are stored in Dataset[(String, Seq[String])]

Excepted result I would like to get Tuple [word, title]

word11, {title1}
word12, {title1}

What I do
1. Make (title, seq[word1,word2,word,3])

docs.mapPartitions { iter =>
  iter.map {
     case (title, contents) => {
        val textToLemmas: Seq[String] = toText(....)
        (title, textToLemmas)
     }
  }
}
  1. I tried use .map to transform my structure to Tuple, but can't do it.
  2. I tried to iterate through all the elements, but then I can not return type

Thanks for answer.

Upvotes: 3

Views: 731

Answers (3)

Jacek Laskowski
Jacek Laskowski

Reputation: 74739

I'm surprised no one offered a solution with Scala's for-comprehension (that gets "desugared" to flatMap and map as in Yuval Itzchakov's answer at compile time).

When you see a series of flatMap and map (possibly with filter) that's Scala's for-comprehension.

So the following:

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

is equivalent to the following:

val result = for {
  (title, words) <- dataSet
  w <- words
} yield (w, title)

After all, that's why we enjoy flexibility of Scala, isn't it?

Upvotes: 2

Haroun Mohammedi
Haroun Mohammedi

Reputation: 2434

Another solution is to call the explode function like this :

import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]

Hope this help you, Best Regrads.

Upvotes: 2

Yuval Itzchakov
Yuval Itzchakov

Reputation: 149598

This should work:

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

Upvotes: 3

Related Questions