Reputation: 351
I am currently working on PySpark with Databricks and I was looking for a way to truncate a string just like the excel right function does.
For example, I would like to change for an ID column in a DataFrame
8841673_3
into 8841673
.
Does anybody knows how I should proceed?
Upvotes: 3
Views: 7771
Reputation: 13823
You can use the pyspark.sql.Column.substr
method:
import pyspark.sql.functions as F
def left(x, n):
return x.substr(0, n)
def right(x, n):
x_len = F.length(x)
return x.substr(x_len - n, x_len)
Upvotes: 3
Reputation: 35219
Regular expressions with regexp_extract
:
from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([("8841673_3", )], ("id", ))
df.select(regexp_extract("id", "^(\d+)_.*", 1)).show()
# +--------------------------------+
# |regexp_extract(id, ^(\d+)_.*, 1)|
# +--------------------------------+
# | 8841673|
# +--------------------------------+
regexp_replace
:
from pyspark.sql.functions import regexp_replace
df.select(regexp_replace("id", "_.*$", "")).show()
# +--------------------------+
# |regexp_replace(id, _.*$, )|
# +--------------------------+
# | 8841673|
# +--------------------------+
or just split
:
from pyspark.sql.functions import split
df.select(split("id", "_")[0]).show()
# +---------------+
# |split(id, _)[0]|
# +---------------+
# | 8841673|
# +---------------+
Upvotes: 4