dedpo
dedpo

Reputation: 502

How to get first value of WrappedArray in Spark?

I grouped by few columns and am getting WrappedArray out of these cols as you can see in schema. How do I get rid of them so I can proceed to next step and do an orderBy?

val sqlDF = spark.sql("SELECT * FROM 
  parquet.`parquet/20171009121227/rels/*.parquet`")

Getting a dataFrame:

val final_df = groupedBy_DF.select(
  groupedBy_DF("collect_list(relev)").as("rel"),
  groupedBy_DF("collect_list(relev2)").as("rel2"))

then printing the schema gives us: final_df.printSchema

|-- rel: array (nullable = true)
|    |-- element: double (containsNull = true)
|-- rel2: array (nullable = true)
|    |-- element: double (containsNull = true)

Sample current output:

enter image description here

I am trying to convert to this:

 |-- rel: double (nullable = true)
 |-- rel2: double (nullable = true)

Desired example output (from the picture above):

-1.0,0.0
-1.0,0.0

Upvotes: 2

Views: 3955

Answers (3)

mochapuff
mochapuff

Reputation: 1

Try split

import org.apache.spark.sql.functions._

val final_df = groupedBy_DF.select(
  groupedBy_DF("collect_list(relev)").as("rel"),
  groupedBy_DF("collect_list(relev2)").as("rel2"))
  .withColumn("rel",split("rel",","))

Upvotes: 0

Shaido
Shaido

Reputation: 28322

In the case where collect_list will always only return one value, use first instead. Then there is no need to handle the issue of having an Array. Note that this should be done during the groupBy step.

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val final_df = df.groupBy(...)
  .agg(first($"relev").as("rel"), 
       first($"relev2").as("rel2"))

Upvotes: 3

ayplam
ayplam

Reputation: 1953

Try col(x).getItem:

groupedBy_DF.select(
    groupedBy_DF("collect_list(relev)").as("rel"),
    groupedBy_DF("collect_list(relev2)").as("rel2")
).withColumn("rel_0", col("rel").getItem(0))

Upvotes: 1

Related Questions