Reputation: 348
I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters.
df = spark.read.csv(path, header=True, schema=availSchema)
I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below
df = df['textcolumn'].str.encode('ascii', 'ignore').str.decode('ascii')
There are no spaces in my column name. I receive an error
TypeError: 'Column' object is not callable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1486957561378215> in <module>
----> 1 InvFilteredDF = InvFilteredDF['SearchResultDescription'].str.encode('ascii', 'ignore').str.decode('ascii')
TypeError: 'Column' object is not callable
Is there an alternative to accomplish this, appreciate any help with this.
Upvotes: 6
Views: 21341
Reputation: 1441
Both answers are really useful, but I couldn't help but notice that we could just add udf
as a decorator and be more pythonic
from pyspark.sql.functions import udf
@udf
def ascii_ignore(x):
return x.encode('ascii', 'ignore').decode('ascii') if x else None
df.withColumn("foo", ascii_ignore('words')).limit(5).show()
Upvotes: 2
Reputation: 1618
This answer worked well for me but it doesn't like NULL. I added a small mod:
def ascii_ignore(x):
if x:
return x.encode('ascii', 'ignore').decode('ascii')
else:
return None
Upvotes: 5
Reputation: 2663
This should work.
First creating a temporary example dataframe:
df = spark.createDataFrame([
(0, "This is Spark"),
(1, "I wish Java could use case classes"),
(2, "Data science is cool"),
(3, "This is aSA")
], ["id", "words"])
df.show()
Output
+---+--------------------+
| id| words|
+---+--------------------+
| 0| This is Spark|
| 1|I wish Java could...|
| 2|Data science is ...|
| 3| This is aSA|
+---+--------------------+
Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable error
Solution
from pyspark.sql.functions import udf
def ascii_ignore(x):
return x.encode('ascii', 'ignore').decode('ascii')
ascii_udf = udf(ascii_ignore)
df.withColumn("foo", ascii_udf('words')).show()
Output
+---+--------------------+--------------------+
| id| words| foo|
+---+--------------------+--------------------+
| 0| This is Spark| This is Spark|
| 1|I wish Java could...|I wish Java could...|
| 2|Data science is ...|Data science is ...|
| 3| This is aSA| This is aSA|
+---+--------------------+--------------------+
Upvotes: 11