Reputation: 105
I'm quite new to pyspark and I have a dataframe that currently looks like below.
| col1 | col2 |
+---------------------------------+-------------------+
| [(a, 0)], [(b,0)].....[(z,1)] | [0, 0, ... 1] |
| [(b, 0)], [(b,1)].....[(z,0)] | [0, 1, ... 0] |
| [(a, 0)], [(c, 1)].....[(z,0)] | [0, 1, ... 0] |
I extracted values from col1.QueryNum
into col2 and when I print the schema, it's an array containing the list of number from col1.QueryNum
.
Ultimately my goal is to convert the list values in col2
into struct format inside pyspark(refer to desired schema).
Current Schema
|-- col1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- types: string (nullable = true)
| | |-- QueryNum: integer (nullable = true)
|-- col2: array (nullable = true)
| |-- element: integer (containsNull = true)
Desired Schema
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- val1: integer (nullable = true)
| | |-- val2: integer (nullable = true)
.
.
.
| | |-- val80: integer (nullable = true)
I tried using from_json
and it's not really working.
Upvotes: 4
Views: 5961
Reputation: 32660
If the you have fixed array size you can create struct using list-comprehension:
from pyspark.sql import functions as F
df1 = df.withColumn(
"col2",
F.array(
F.struct(*[
F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
])
)
)
df1.show()
#+----------------+--------+
#|col1 |col2 |
#+----------------+--------+
#|[[0, a], [0, b]]|[[0, 0]]|
#|[[0, b], [1, b]]|[[0, 1]]|
#|[[0, a], [1, c]]|[[0, 1]]|
#+----------------+--------+
df1.printSchema()
#root
#|-- col1: array (nullable = true)
#| |-- element: struct (containsNull = true)
#| | |-- QueryNum: long (nullable = true)
#| | |-- types: string (nullable = true)
#|-- col2: array (nullable = false)
#| |-- element: struct (containsNull = false)
#| | |-- val1: long (nullable = true)
#| | |-- val2: long (nullable = true)
Note however that there is no need to use array in this case as you'll always have one struct in that array. Just use simple struct:
df1 = df.withColumn(
"col2",
F.struct(*[
F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
])
)
Or if you prefer a map type:
df1 = df.withColumn(
"col2",
F.map_from_entries(
F.expr("transform(col1, (x,i) -> struct('val' || (i+1) as name, x.QueryNum as value))")
)
)
Upvotes: 4