Reputation: 57
I have a **Pyspark dataframe** consisting of about 6 Million lines. The dataset has the following structure:
root
|-- content: string (nullable = true)
|-- score: string (nullable = true)
+--------------------+-----+
| text |score|
+--------------------+-----+
|word word hello d...| 5|
|hi man how are yo...| 5|
|come on guys let ...| 5|
|do you like some ...| 1|
|accept | 1|
+--------------------+-----+
Is there a way to remove all lines that contain only sentences of at least 4 words in length? In order to delete all the lines with a few words.
I did it this way, but it takes a long time:
pandasDF = df.toPandas()
cnt = 0
ind = []
for index, row in pandasDF.iterrows():
txt = row["text"]
spl = txt.split()
if((len(spl)) < 4):
ind.append(index)
cnt += 1
pandasDF = pandasDF.drop(labels=ind, axis=0)
Is there a way to do this faster and without turning my Pyspark dataframe into a Pandas data frame?
Upvotes: 1
Views: 717