Eli
Eli

Reputation: 3

Using python with spark dataframes how do you filter an array with the value of a column

I am trying to find a way to filter based off the value of one column to search in another column. If I have a value in one column, I want verify that the value is also in the array in a different column.

I tried the following:

df = sc.parallelize([('v1', ['v1','v2','v3']),('v4' ['v1','v2','v4'])]).toDF()
df.filter(pyspark.sql.functions.array_contains(df._2, df._1)).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/pyspark/sql/functions.py", line 1648, in array_contains
    return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1088, in _build_args
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1075, in _get_args
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 512, in convert

TypeError: 'Column' object is not callable

What I am looking for is something similar to

df.filter(pyspark.sql.functions.array_contains(df._2, 'v4'))

but I don't want to use a static value. I wanted to use the value from column _1.

Upvotes: 0

Views: 337

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35219

You have to use expression for this:

df.filter("array_contains(_2, _1)").show()
+---+------------+
| _1|          _2|
+---+------------+
| v1|[v1, v2, v3]|
| v4|[v1, v2, v4]|
+---+------------+

Upvotes: 1

Related Questions