Reputation: 502
I grouped by few columns and am getting WrappedArray
out of these cols as you can see in schema. How do I get rid of them so I can proceed to next step and do an orderBy
?
val sqlDF = spark.sql("SELECT * FROM
parquet.`parquet/20171009121227/rels/*.parquet`")
Getting a dataFrame:
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
then printing the schema gives us: final_df.printSchema
|-- rel: array (nullable = true)
| |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
| |-- element: double (containsNull = true)
Sample current output:
I am trying to convert to this:
|-- rel: double (nullable = true)
|-- rel2: double (nullable = true)
Desired example output (from the picture above):
-1.0,0.0
-1.0,0.0
Upvotes: 2
Views: 3955
Reputation: 1
Try split
import org.apache.spark.sql.functions._
val final_df = groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2"))
.withColumn("rel",split("rel",","))
Upvotes: 0
Reputation: 28322
In the case where collect_list
will always only return one value, use first
instead. Then there is no need to handle the issue of having an Array. Note that this should be done during the groupBy
step.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val final_df = df.groupBy(...)
.agg(first($"relev").as("rel"),
first($"relev2").as("rel2"))
Upvotes: 3
Reputation: 1953
Try col(x).getItem
:
groupedBy_DF.select(
groupedBy_DF("collect_list(relev)").as("rel"),
groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))
Upvotes: 1