Edamame
Edamame

Reputation: 25366

Spark-SQL : Access array elements storing within a cell in a data frame

root
 |
 |-- dogs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- color: string (nullable = true)
 |    |    |    |-- sources: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |-- _2: age (nullable = true)

Which shows below with data.select("dogs").show(2,False)

+---------------------------------------------------------------------------------+
|names                                                                            |
+---------------------------------------------------------------------------------+
|[[[Max,White,WrappedArray(SanDiego)],3], [[Spot,Black,WrappedArray(SanDiego)],2]]|
|[[[Michael,Black,WrappedArray(SanJose)],1]]                                      |
+---------------------------------------------------------------------------------+
only showing top 2 rows

I am wondering if it is possible to access the array elements in each cell? For example, I want to retrieve (Max, white), (Spot, Black) and (Michael, Black) from the dogs column.

In additional, I would like to expand the rows with n elements to n rows if possible.

Thanks!

Upvotes: 2

Views: 3570

Answers (1)

David
David

Reputation: 11593

You can use explode as below to get access to a dataframe with each row being a record from the array.

data.registerTempTable("data")
dataExplode = sqlContext.sql("select explode(dogs) as dog from data")
dataExplode.show()

Then, you can use select to obtain just the columns you are interested in.

Upvotes: 3

Related Questions