How to remove words that have less than three letters in PySpark?

Question

I have a 'text' column in which arrays of tokens are stored. How to filter all these arrays so that the tokens are at least three letters long?

from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate()

columns = ['id', 'text']
vals = [
    (1, ['I', 'am', 'good']),
    (2, ['You', 'are', 'ok']),
]

df = spark.createDataFrame(vals, columns)
df.show()

# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word 
# in col('text')], ''))
# df_clean.show()

I expect to see:

id  |  text  
1   |  [good]
2   |  [You, are]

vndywarhol · Accepted Answer

This is the solution

filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words_filtered', filter_length_udf(col('words')))

How to remove words that have less than three letters in PySpark?

Answers (2)

Related Questions