Reputation: 1599
I would like to remove some duplicated words in a column of pyspark dataframe.
based on Remove duplicates from PySpark array column
My Spark:
2.4.5
Py3 code:
test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.
t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
t5 = t4.withColumn('text', F.array_distinct("text"))
t5.show(1, 120)
but got
+--------------------------------------------------------+
| text|
+--------------------------------------------------------+
|[i like this book and this book be downloaded on line]|
+--------------------------------------------------------+
I need to remove
book and this
It seems that the "array_distinct" cannot filter them out ?
thanks
Upvotes: 4
Views: 2350
Reputation: 20445
You can use lcase , split , array_distinct and array_join functions from pyspark sql.functions
For example, F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")
Here is working code
import pyspark.sql.functions as F
df
.withColumn("text_new",
F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)
Explaination:
Here, you first convert everthing to lower case with lcase(text)
than split the array on whitespace with split(text,' ')
, which produces
[i, like, this, book, and, this, book, be, downloaded, on, line]|
then you pass this toarray_distinct
, which produces
[i, like, this, book, and, be, downloaded, on, line]
and finally, join it with whitespace using array_join
i like this book and be downloaded on line
Upvotes: 3