codebot
codebot

Reputation: 2646

pyspark: substring a string using dynamic index

filtered_df = filtered_df.withColumn('POINT', substring('POINT', instr(filtered_df.POINT, "#"), 30))

I need to get the first index of the # in the string and then pass that index as the substring starting position as above. What would be the way to do that?

This gives me TypeError: Column is not iterable.

Upvotes: 2

Views: 2411

Answers (1)

ScootCork
ScootCork

Reputation: 3676

The substring function from pyspark.sql.functions only takes fixed starting position and length. However your approach will work using an expression.

import pyspark.sql.functions as F

d = [{'POINT': 'The quick # brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog'},
    {'POINT': 'The quick brown fox jumps over the lazy dog.# The quick brown fox jumps over the lazy dog.'}]
df = spark.createDataFrame(d)

df.withColumn('POINT', F.expr("substring(POINT, instr(POINT, '#'), 30)")).show(2, False)

+------------------------------+
|POINT                         |
+------------------------------+
|# brown fox jumps over the laz|
|# The quick brown fox jumps ov|
+------------------------------+

Upvotes: 4

Related Questions