Mohammad Saifullah
Mohammad Saifullah

Reputation: 1143

Pyspark: filter contents of array inside row

In Pyspark, one can filter an array using the following code:

lines.filter(lambda line: "some" in line)

But I have read data from a json file and tokenized it. Now it has the following form:

df=[Row(text=u"i have some text", words=[u'I', u'have', u"some'", u'text'])]

How can I filter out "some" from words array ?

Upvotes: 2

Views: 2941

Answers (1)

eliasah
eliasah

Reputation: 40380

You can use array_contains, it's available since 1.4 :

from pyspark.sql import Row
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([Row(text=u"i have some text", words=[u'I', u'have', u'some', u'text'])])

df.withColumn("keep", F.array_contains(df.words,"some")) \
  .filter(F.col("keep")==True).show()
# +----------------+--------------------+----+
# |            text|               words|keep|
# +----------------+--------------------+----+
# |i have some text|[I, have, some, t...|true|
# +----------------+--------------------+----+

If you want to filter out 'some', like I said in the comment, you can use the StopWordsRemover API

from pyspark.ml.feature import StopWordsRemover

StopWordsRemover(inputCol="words", stopWords=["some"]).transform(df)

Upvotes: 4

Related Questions