Srinivas
Srinivas

Reputation: 2098

Converting a dataframe to an array of struct of column names and values

Suppose I have a dataframe like this

val customer = Seq(
    ("C1", "Jackie Chan", 50, "Dayton", "M"),
    ("C2", "Harry Smith", 30, "Beavercreek", "M"),
    ("C3", "Ellen Smith", 28, "Beavercreek", "F"),
    ("C4", "John Chan", 26, "Dayton","M")
  ).toDF("cid","name","age","city","sex")

How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark

Upvotes: 0

Views: 1128

Answers (3)

Topde
Topde

Reputation: 581

Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )

or wrap the map col in an array if you want it

This way you can still do numerical or string transformations on the relevant key or value. For example:

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )
df.select('*', 
      map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)

Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

Upvotes: 1

Oli
Oli

Reputation: 10406

The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:

val cols = customer.columns.tail
val result = customer.select('cid,
    array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")

result.show(false)

+---+-----------------------------------------------------------+
|cid|array                                                      |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]]     |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]]       |
+---+-----------------------------------------------------------+

result.printSchema()

root
 |-- cid: string (nullable = true)
 |-- array: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- value: string (nullable = true)

Upvotes: 5

Michel Lemay
Michel Lemay

Reputation: 2094

You can do it using array and struct functions:

customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))

will make:

 |-- cid: string (nullable = true)
 |-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- column_name: string (nullable = false)
 |    |    |-- column_value: string (nullable = true)

Upvotes: 2

Related Questions