Reputation: 7540
Is there an easy way how use explode
on array column on SparkSQL DataFrame
? It's relatively simple in Scala, but this function seems to be unavailable (as mentioned in javadoc) in Java.
An option is to use SQLContext.sql(...)
and explode
function inside the query, but I'm looking for a bit better and especially cleaner way. DataFrame
s are loaded from parquet files.
Upvotes: 9
Views: 14709
Reputation: 192
I have the similar problem with a little twist i have the data is ";" separated value , so I have to convert it in array then use explode on it
Dataset<Row> modelNum = csv.select("ROWNUM", "MODELNUM");
//converting string data type to array then using explode function to decompose
modelNum = modelNum.withColumn("MODEL", split(modelNum.col("MODELNUM"), ";"));
modelNum = modelNum.select(modelNum.col("ROWNUM"), explode(modelNum.col("MODEL")).alias("MODEL_"));
data looks like below before and after explode
+------+-------------+ |ROWNUM|MODEL_|
|ROWNUM| MODELNUM| +------+------+
+------+-------------+ | 1|1060008|
| 1|1060008;1261290| | 1|1261290|
| 2|1060008;1261290| | 2|1060008|
| 3|1060008;1261290| | 2|1261290|
| 3|1060008|
| 3|1261290|
Upvotes: 0
Reputation: 994
I solved it in this manner: say that you have an array column containing job descriptions named "positions", for each person with "fullName".
Then you get from initial schema :
root
|-- fullName: string (nullable = true)
|-- positions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- companyName: string (nullable = true)
| | |-- title: string (nullable = true)
...
to schema:
root
|-- personName: string (nullable = true)
|-- companyName: string (nullable = true)
|-- positionTitle: string (nullable = true)
by doing:
DataFrame personPositions = persons.select(persons.col("fullName").as("personName"),
org.apache.spark.sql.functions.explode(persons.col("positions")).as("pos"));
DataFrame test = personPositions.select(personPositions.col("personName"),
personPositions.col("pos").getField("companyName").as("companyName"), personPositions.col("pos").getField("title").as("positionTitle"));
Upvotes: 20
Reputation: 7540
It seems it is possible to use a combination of org.apache.spark.sql.functions.explode(Column col)
and DataFrame.withColumn(String colName, Column col)
to replace the column with the exploded version of it.
Upvotes: 6