Remove row in Pyspark data frame that contains less than n word

Question

I have a **Pyspark dataframe** consisting of about 6 Million lines. The dataset has the following structure:

root
 |-- content: string (nullable = true)
 |-- score: string (nullable = true)

+--------------------+-----+
|               text |score|
+--------------------+-----+
|word word hello d...|    5|
|hi man how are yo...|    5|
|come on guys let ...|    5|
|do you like some ...|    1|
|accept              |    1|
+--------------------+-----+

Is there a way to remove all lines that contain only sentences of at least 4 words in length? In order to delete all the lines with a few words.
I did it this way, but it takes a long time:

pandasDF = df.toPandas()
cnt = 0
ind = []
for index, row in pandasDF.iterrows():
  txt = row["text"]
  spl = txt.split()
  if((len(spl)) < 4):
    ind.append(index)
    cnt += 1

pandasDF = pandasDF.drop(labels=ind, axis=0)

Is there a way to do this faster and without turning my Pyspark dataframe into a Pandas data frame?

werner · Accepted Answer

Each text can be split into single words with split and the number of words can then be counted with size:

from pyspark.sql import functions as F

df.filter( F.size(F.split('text', ' ')) >= 4).show()

This statements keeps only rows that contain at least 4 words.

Remove row in Pyspark data frame that contains less than n word

Answers (1)

Related Questions