Reputation: 1328
s = ["abcd:{'name':'john'}","defasdf:{'num':123}"]
df = spark.createDataFrame(s, "string").toDF("request")
display(df)
+--------------------+
| request|
+--------------------+
|abcd:{'name':'john'}|
| defasdf:{'num':123}|
+--------------------+
I would like to get as
+--------------------+---------------+
| request| sub|
+--------------------+---------------+
|abcd:{'name':'john'}|{'name':'john'}|
| defasdf:{'num':123}| {'num':123}|
+--------------------+---------------+
I did write as below, but it is throwing error :
TypeError: Column is not iterable
df = df.withColumn("sub",substring(col('request'),locate('{',col('request')),length(col('request'))-locate('{',col('request'))))
df.show()
Can someone please help me ?
Upvotes: 1
Views: 4928
Reputation: 32660
You need to use substring
function in SQL expression in order to pass columns for position
and length
arguments. Note also that you need to add +1
to length to get correct result:
import pyspark.sql.functions as F
df = df.withColumn(
"json",
F.expr("substring(request, locate('{',request), length(request) - locate('{', request) + 1)")
)
df.show()
#+--------------------+---------------+
#| request| json|
#+--------------------+---------------+
#|abcd:{'name':'john'}|{'name':'john'}|
#| defasdf:{'num':123}| {'num':123}|
#+--------------------+---------------+
You could also consider using regexp_extract
function instead of substring like this:
df = df.withColumn(
"json",
F.regexp_extract("request", "^.*:(\\{.*\\})$", 1)
)
Upvotes: 2