Reputation: 5294
Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.
from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1]
df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show()
This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL
Upvotes: 20
Views: 108847
Reputation: 2135
SQL alternative
df = spark.sql("SELECT
COLUMN_NAME,
LENGTH(COLUMN_NAME) AS length,
SUBSTRING(COLUMN_NAME, 2, LENGTH(COLUMN_NAME) - 2) AS fixed_in_sql
FROM
your_table")
Upvotes: 0
Reputation: 1099
The accepted answer uses a udf
(user defined function), which is usually (much) slower than native spark code. Grant Shannon's answer does use native spark code, but as noted in the comments by citynorman, it is not 100% clear how this works for variable string lengths.
Answer with native spark code (no udf) and variable string length
From the documentation of substr in pyspark, we can see that the arguments: startPos and length can be either int
or Column
types (both must be the same type). So we just need to create a column that contains the string length and use that as argument.
import pyspark.sql.functions as F
result = (
df
.withColumn('length', F.length('COLUMN_NAME'))
.withColumn('fixed_in_spark', F.col('COLUMN_NAME').substr(F.lit(2), F.col('length') - F.lit(2)))
)
# result:
+----------------+---------------+----+--------------+
| COLUMN_NAME|COLUMN_NAME_fix|size|fixed_in_spark|
+----------------+---------------+----+--------------+
| _string_| string| 8| string|
|_another string_| another string| 16|another string|
+----------------+---------------+----+--------------+
Note:
F.lit
because we cannot add (or subtract) a number to a Column
object. We need to first convert that number into a Column
.Upvotes: 13
Reputation: 767
if the goal is to remove '_' from the column names then I would use list comprehension instead:
df.select(
[ col(c).alias(c.replace('_', '') ) for c in df.columns ]
)
Upvotes: 0
Reputation: 5870
pyspark.sql.functions.substring(str, pos, len)
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
In your code,
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null
Try this, (with fixed syntax)
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()
Upvotes: 28
Reputation: 5055
try:
df.withColumn('COLUMN_NAME_fix', df['COLUMN_NAME'].substr(1, 10)).show()
where 1 = start position in the string and 10 = number of characters to include from start position (inclusive)
Upvotes: 20