Reputation: 2098
Suppose I have a dataframe like this
val customer = Seq(
("C1", "Jackie Chan", 50, "Dayton", "M"),
("C2", "Harry Smith", 30, "Beavercreek", "M"),
("C3", "Ellen Smith", 28, "Beavercreek", "F"),
("C4", "John Chan", 26, "Dayton","M")
).toDF("cid","name","age","city","sex")
How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > >
in spark
Upvotes: 0
Views: 1128
Reputation: 581
Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
or wrap the map col in an array if you want it
This way you can still do numerical or string transformations on the relevant key or value. For example:
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
df.select('*',
map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)
Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere
Upvotes: 1
Reputation: 10406
The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age
is an int in your case). Here is how it goes:
val cols = customer.columns.tail
val result = customer.select('cid,
array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]] |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]] |
+---+-----------------------------------------------------------+
result.printSchema()
root
|-- cid: string (nullable = true)
|-- array: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = false)
| | |-- value: string (nullable = true)
Upvotes: 5
Reputation: 2094
You can do it using array and struct functions:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
will make:
|-- cid: string (nullable = true)
|-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- column_name: string (nullable = false)
| | |-- column_value: string (nullable = true)
Upvotes: 2