Reputation: 3
I am trying to find a way to filter based off the value of one column to search in another column. If I have a value in one column, I want verify that the value is also in the array in a different column.
I tried the following:
df = sc.parallelize([('v1', ['v1','v2','v3']),('v4' ['v1','v2','v4'])]).toDF()
df.filter(pyspark.sql.functions.array_contains(df._2, df._1)).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/pyspark/sql/functions.py", line 1648, in array_contains
return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1088, in _build_args
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1075, in _get_args
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 512, in convert
TypeError: 'Column' object is not callable
What I am looking for is something similar to
df.filter(pyspark.sql.functions.array_contains(df._2, 'v4'))
but I don't want to use a static value. I wanted to use the value from column _1.
Upvotes: 0
Views: 337
Reputation: 35219
You have to use expression for this:
df.filter("array_contains(_2, _1)").show()
+---+------------+
| _1| _2|
+---+------------+
| v1|[v1, v2, v3]|
| v4|[v1, v2, v4]|
+---+------------+
Upvotes: 1