practicalGuy
practicalGuy

Reputation: 1328

TypeError: Column is not iterable

s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)
+--------------------+
|             request|
+--------------------+
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
+--------------------+

I would like to get as

+--------------------+---------------+
|             request|            sub|
+--------------------+---------------+
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}|    {'num':123}|
+--------------------+---------------+

I did write as below, but it is throwing error :

TypeError: Column is not iterable

df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()

Can someone please help me ?

Upvotes: 1

Views: 4928

Answers (1)

blackbishop
blackbishop

Reputation: 32660

You need to use substring function in SQL expression in order to pass columns for position and length arguments. Note also that you need to add +1 to length to get correct result:

import pyspark.sql.functions as F

df = df.withColumn(
    "json",
    F.expr("substring(request, locate('{',request), length(request) - locate('{', request) + 1)")
)

df.show()
#+--------------------+---------------+
#|             request|           json|
#+--------------------+---------------+
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}|    {'num':123}|
#+--------------------+---------------+

You could also consider using regexp_extract function instead of substring like this:

df = df.withColumn(
    "json",
    F.regexp_extract("request", "^.*:(\\{.*\\})$", 1)
)

Upvotes: 2

Related Questions