Pavel Orlov
Pavel Orlov

Reputation: 23

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:

 <pre>
<b>ID</b> |  <b>String</b>  |  <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em>  | <em>sometext</em> |   <em>1</em>   |  <em>2</em>   |  <em>3</em>
</pre>

I need to convert it into:

<pre>
<b>ID</b> |  <b>String</b>  |  <b>resultColumn</b>
<em>1</em>  | <em>sometext</em> |     <em>1</em>  
<em>1</em>  | <em>sometext</em> |     <em>2</em>   
<em>1</em>  | <em>sometext</em> |     <em>3</em>    
</pre>

I just have dataFrame which is connected with first schema(table).

val df: dataFrame

Note: I can do it using RDD, but do we have other way? Thanks

Upvotes: 1

Views: 6000

Answers (1)

greg hor
greg hor

Reputation: 792

Assuming that df has the schema of your first snippet, I would try:

df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))

I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression

val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))

then use getItem (since you can not use generator on nested expressions, you need 2 select here)

df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))

It is a bit verbose though, there might be more elegant way to do this.

Upvotes: 1

Related Questions