joao pedro imamura
joao pedro imamura

Reputation: 23

Concat two JSON string columns with PySpark

Suppose I have two columns (StringType):

JSON 1 JSON 2
{"key1":"value1"} {"key2":"value2"}
{"key3":"value3"} {"key4":"value4"}

I need to concat these columns to have a LIST of JSON in the third column (also string), something like this:

JSON 1 JSON 2 JSON 3
{"key1":"value1"} {"key2":"value2"} [{"key1":"value1"},{"key2":"value2"}]
{"key3":"value3"} {"key4":"value4"} [{"key3":"value3"},{"key4":"value4"}]

Actually I solved the problem, but I don't think it's the best approach:

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, concat, concat_ws, lit

def concat_jsons(df: DataFrame, columns: list):
    df = df.withColumn(
        'JSON 3', concat_ws(',', *columns)
    )

    return df.withColumn(
        'JSON 3',
        concat(lit('['), col('JSON 3'), lit(']'))
    )

Anyone has a better idea?

Upvotes: 2

Views: 106

Answers (1)

Sreejith k
Sreejith k

Reputation: 56

data =[["""{"key1":"value1"}""", """{"key2":"value2"}"""], ["""{"key3":"value3"}""","""{"key4":"value4"}"""] ]

df = spark.createDataFrame(data).toDF('JSON 1','JSON 2')

df.show()
+-----------------+-----------------+
|           JSON 1|           JSON 2|
+-----------------------------------+
|{"key1":"value1"}|{"key2":"value2"}|
|{"key3":"value3"}|{"key4":"value4"}|
+-----------------+-----------------+

def concat_jsons(df: DataFrame, columns: list):
...   return df.withColumn('JSON 3',array(*columns))

concat_jsons(df, ['JSON 1','JSON 2']).show()

+-----------------+-----------------+--------------------+
|           JSON 1|           JSON 2|              JSON 3|
+-----------------+-----------------+--------------------+
|{"key1":"value1"}|{"key2":"value2"}|[{"key1":"value1"...|
|{"key3":"value3"}|{"key4":"value4"}|[{"key3":"value3"...|
+-----------------+-----------------+--------------------+

if you want "JSON 3" to be for StringType, cast it to string

def concat_jsons(df: DataFrame, columns: list):
...   return df.withColumn('JSON 3',array(*columns).cast('string'))

Upvotes: 2

Related Questions