OdiumPura
OdiumPura

Reputation: 631

Get a substring from pyspark DF

I have the following DF

name
Shane
Judith
Rick Grimes

I want to generate the following one

name           substr
Shane          hane
Judith         udith
Rick Grimes    ick Grimes

I tried:

F.substring(F.col('name'), 1)
F.substring(F.col('name'), 1, None)
F.substring(F.col('name'), 1, F.length(F.col('name')))

But all of those methods throws me an error.

How can I get my desired output?

Upvotes: 0

Views: 2615

Answers (3)

yogesh
yogesh

Reputation: 148

spark documentation for substring states that index is not 0 based

The position is not zero based, but 1 based index.

pyspark.sql.functions.substring

df.withColumn('sub_string', F.expr("substring(name, 2, length(name)-1)"))

Upvotes: 2

Shreyas B
Shreyas B

Reputation: 505

You can use expr to get the desired output

from pyspark.sql.functions import expr
F.withColumn('substr',expr("substring(name, 2, length(name)-1)"))

Upvotes: 2

Emma
Emma

Reputation: 9308

F.substring takes the integer so it only works if you pass integers.

F.substring('name', 2, 5)

# This doesn't work. substring doesn't take Column (F.length()) 
F.substring('name', 2, F.length('name'))

If you would like to pass a dynamic value, you can do either SQL's substring or Col.substr.

SQL

F.expr('substring(name, 2, length(name))')

Col.substr(startPos, length)

This will take Column (Many Pyspark function returns Column including F.length) or int. Although, startPos and length has to be in the same type. eg: If you need to pass Column for length, use lit for the startPos.

F.col('name').substr(F.lit(2), F.length('name'))

# If you pass integer for both.
# F.col('name').substr(2, 5)

Upvotes: 1

Related Questions