Ashley O
Ashley O

Reputation: 1190

How to create column as array from other columns (got stuck with incompatible types)?

Say I have some subscription data that looks like this:

user_id      subscription_id       expires_at
0238423            113              12/1/18
0238423            938              11/1/18
0238423            901              10/1/18

Now I want to create a new column that is a callable array of the subscription_id and expires_at columns:

user_id          subscription_id    expires_at         Array_Col
0238423            113              12/1/18          [113, 12/1/18]
0238423            938              11/1/18          [938, 11/1/18]
0238423            901              10/1/18          [901, 10/1/18]

The problem is I'm not able to get the struct type correct. I've cast them as strings, but then I can't iterate over it properly. I also want to eventually collect_set on this column to produce an array of arrays per each user_id. What is wrong with my struct?

Here is my code:

def create_struct(subscription_id, expires_at):
    x = [subscription_id, expires_at]
    return x  

create_struct = udf(create_struct, ArrayType(StructType([
    StructField("sub_id", StringType(), False),
    StructField("expiration", TimestampType(), True)])))

df = df.withColumn('expiration_dict', create_struct(df.subscription_id, df.expires_at))

Upvotes: 1

Views: 43

Answers (2)

Mohd Avais
Mohd Avais

Reputation: 247

Alternatively, if you want array of columns, use array() from spark.sql.functions

df.withColumn("Array_Col", array($"subscription_id", $"expires_at"))

Upvotes: 0

akuiper
akuiper

Reputation: 214927

I think you need a struct type column instead of array type. It's not a good idea to put two columns of different types into an array. To create a struct type column, simply call the struct function:

from pyspark.sql.functions import struct
df.withColumn('Struct_Col', struct(df.subscription_id, df.expires_at)).show()

#+-------+---------------+----------+--------------+
#|user_id|subscription_id|expires_at|    Struct_Col|
#+-------+---------------+----------+--------------+
#| 238423|            113|   12/1/18|[113, 12/1/18]|
#| 238423|            938|   11/1/18|[938, 11/1/18]|
#| 238423|            901|   10/1/18|[901, 10/1/18]|
#+-------+---------------+----------+--------------+

Upvotes: 3

Related Questions