Reputation: 15258
I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}]
(the string is exactly like that in my column, it is a string, not an array)
I want to convert this to an array of struct.
Can I do that simply with native spark function or do I have to parse the string or use UDF ?
sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
).show()
+---+--------------------+
|idx| txt|
+---+--------------------+
| 1|[{"AppId":"APACON...|
| 2|[{"AppId":"APACON...|
+---+--------------------+
Upvotes: 0
Views: 2682
Reputation: 1030
With Spark 2.1 or above
You have the following data :
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sqlContext.createDataFrame(
[ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
(2,'[{"AppId":"APACON","ExtId":"141793"}]'),
],
['idx','txt']
)
you can indeed use pyspark.sql.functions.from_json as follows :
schema = StructType([StructField("AppId", StringType()),
StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()
+---+--------------------+---------------+
|idx| txt| array|
+---+--------------------+---------------+
| 1|[{"AppId":"APACON...|[APACON,141730]|
| 2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+
Version < Spark 2.1
One way to bypass the issue, would be to first slightly modify your input string to have :
# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()
+---+-------------------------------------+-----------------------------------+
|idx|txt |txt_parsed |
+---+-------------------------------------+-----------------------------------+
|1 |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2 |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+
Then you could use pyspark.sql.functions.get_json_object to parse the txt column
df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()
+---+--------------------+------+------+
|idx| txt| AppId| ExtId|
+---+--------------------+------+------+
| 1|{"AppId":"APACON"...|APACON|141730|
| 2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+
Upvotes: 1