JiriS
JiriS

Reputation: 7540

SparkSQL and explode on DataFrame in Java

Is there an easy way how use explode on array column on SparkSQL DataFrame? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.

An option is to use SQLContext.sql(...) and explode function inside the query, but I'm looking for a bit better and especially cleaner way. DataFrames are loaded from parquet files.

Upvotes: 9

Views: 14709

Answers (3)

JustTry
JustTry

Reputation: 192

I have the similar problem with a little twist i have the data is ";" separated value , so I have to convert it in array then use explode on it

Dataset<Row> modelNum = csv.select("ROWNUM", "MODELNUM");
//converting string data type to array then using explode function to decompose
modelNum = modelNum.withColumn("MODEL", split(modelNum.col("MODELNUM"), ";"));
modelNum = modelNum.select(modelNum.col("ROWNUM"), explode(modelNum.col("MODEL")).alias("MODEL_"));

data looks like below before and after explode

+------+-------------+    |ROWNUM|MODEL_|
|ROWNUM|     MODELNUM|    +------+------+
+------+-------------+    |     1|1060008|
|     1|1060008;1261290|  |     1|1261290|
|     2|1060008;1261290|  |     2|1060008|
|     3|1060008;1261290|  |     2|1261290|
                          |     3|1060008|
                          |     3|1261290|

Upvotes: 0

marilena.oita
marilena.oita

Reputation: 994

I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".

Then you get from initial schema :

root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- companyName: string (nullable = true)
    |    |    |-- title: string (nullable = true)
...

to schema:

root
 |-- personName: string (nullable = true)
 |-- companyName: string (nullable = true)
 |-- positionTitle: string (nullable = true)

by doing:

    DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
          org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));

    DataFrame test = personPositions.select(personPositions.col("personName"),
    personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));

Upvotes: 20

JiriS
JiriS

Reputation: 7540

It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col) and DataFrame.withColumn(String colName, Column col) to replace the column with the exploded version of it.

Upvotes: 6

Related Questions