How to extract particular element from the column for each row?

Question

I have the following DataFrame in Spark 2.2.0 and Scala 2.11.8.

+----------+-------------------------------+
|item      |        other_items            |
+----------+-------------------------------+
|  111     |[[444,1.0],[333,0.5],[666,0.4]]|
|  222     |[[444,1.0],[333,0.5]]          |
|  333     |[]                             |
|  444     |[[111,2.0],[555,0.5],[777,0.2]]|

I want to get the following DataFrame:

+----------+-------------+
|item      | other_items |
+----------+-------------+
|  111     | 444         |
|  222     | 444         |
|  444     | 111         |

So, basically, I need to extract the first item from other_items for each row. Also, I need to ignore those rows that have empty array [] in other_products.

How can I do it?

I tried this approach, but it does not give me an expected result.

result = df.withColumn("other_items",$"other_items"(0))

The printScheme gives the following output:

 |-- item: string (nullable = true)
 |-- other_items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: string (nullable = true)
 |    |    |-- _2: double (nullable = true)

user8991934 · Accepted Answer

Like this:

val df = Seq(
  ("111", Seq(("111", 1.0), ("333", 0.5), ("666", 0.4))), ("333", Seq())
).toDF("item", "other_items")


df.select($"item", $"other_items"(0)("_1").alias("other_items"))
  .na.drop(Seq("other_items")).show

Where first apply ($"other_items"(0)) selects the first element of the array, the second apply (_("_1")) selects _1 field, and na.drop removes nulls introduced by empty array.

+----+-----------+
|item|other_items|
+----+-----------+
| 111|        111|
+----+-----------+

How to extract particular element from the column for each row?

Answers (1)

Related Questions