Concatenate two string column values to array of integer using pyspark

Question

I've a dataframe having different eventtype(10, 11, 12), data varies based on the eventtype, my requirement here is, when eventtype=11 I would need to concatenate two columns data(sensor, field1) to array[0,1] in the sensor column and shift the data to one left, i've given my desired output, is there a way I can achieve this using pyspark.

I was able to concatenate using spark.sql concat() function, but looking for a more generic code which would concatenate and shift the column vales to one left and assign the last column value as NULL.

eventtype   sensor  field1  field2  eventcategory
 11           1       0     SE_500  500

Expected output:

eventtype, sensor, field1, field2, eventcategory
 11         [1,0]   SE_500   500    NULL

mck · Accepted Answer

You can use when to construct an appropriate selection. Note that a column can only have one type, so if the eventtype != 11, the sensor column has to be converted into an array of one element.

import pyspark.sql.functions as F

df2 = df.select(
    'eventtype', 
    F.when(
        F.col('eventtype') == 11, 
        F.array('sensor', 'field1')
    ).otherwise(F.array('sensor')).alias('sensor'), 
    *[
        F.when(
            F.col('eventtype') == 11, 
            F.col(df.columns[i+3]) if i+3 < len(df.columns) else F.lit(None)
        ).otherwise(F.col(c)).alias(c) 
        for (i, c) in enumerate(df.columns[2:])
    ]
)

df2.show()
+---------+------+------+------+-------------+
|eventtype|sensor|field1|field2|eventcategory|
+---------+------+------+------+-------------+
|       11|[1, 0]|SE_500|   500|         null|
|       10|   [1]|     0|SE_500|          500|
+---------+------+------+------+-------------+

Concatenate two string column values to array of integer using pyspark

Answers (1)

Related Questions