Reputation: 39
I have the following DataFrame in a Spark (I'm using Scala):
[[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]]
I want to get a Dataframe with only first Ints of each sub-array, something like:
[1003014, 15, 754, 1029530, 3066, 1066440, ...]
Keeping hence only the x[0]
of each sub-array x of the Array listed above.
I'm new to Scala, and couldn't find the right anonymous map function. Thanks in advance for any help
Upvotes: 2
Views: 27458
Reputation: 32660
For Spark >= 2.4, you can use Higher-Order Function transform
with lambda function to extract the first element of each value array.
scala> df.show(false)
+----------------------------------------------------------------------------------------+
|arrays |
+----------------------------------------------------------------------------------------+
|[[1003014.0, 0.95266926], [15.0, 0.9484202], [754.0, 0.94236785], [1029530.0, 0.880922]]|
+----------------------------------------------------------------------------------------+
scala> df.select(expr("transform(arrays, x -> x[0])").alias("first_array_elements")).show(false)
+-----------------------------------+
|first_array_elements |
+-----------------------------------+
|[1003014.0, 15.0, 754.0, 1029530.0]|
+-----------------------------------+
Spark < 2.4
Explode the initial array and then aggregate with collect_list
to collect the first element of each sub array:
df.withColumn("exploded_array", explode(col("arrays")))
.agg(collect_list(col("exploded_array")(0)))
.show(false)
EDIT:
In case the array contains structs and not sub-arrays, just change the accessing method using dots for struct elements:
val transfrom_expr = "transform(arrays, x -> x.canonical_id)"
df.select(expr(transfrom_expr).alias("first_array_elements")).show(false)
Upvotes: 5
Reputation: 27373
Using Spark 2.4:
val df = Seq(
Seq(Seq(1.0,2.0),Seq(3.0,4.0))
).toDF("arrs")
df.show()
+--------------------+
| arrs|
+--------------------+
|[[1.0, 2.0], [3.0...|
+--------------------+
df
.select(expr("transform(arrs, x -> x[0])").as("arr_first"))
.show()
+----------+
| arr_first|
+----------+
|[1.0, 3.0]|
+----------+
Upvotes: 2