Reputation: 905
I'm a newbie in pyspark and I'm looking for finding a column1 item in column2 list of items. Let's have an example.
Let's have an easy example with google search - 3 columns - query
- string, hrefs
- list and clicked_url
- string and I need to check if clicked_url
is in hrefs
.
There are 3 possible scenarios that can occur:
If clicked_url is in hrefs list THEN keep the row
If clicked_url is not in hrefs list THEN filter out that row -remove. (you can imagine that as a click on advertisement)
If clicked_url is empty (there was no clicked_url) THEN keep the row but change the value of clicked_url to 0.
query | hrefs | clicked_url
------------------------------------
car |[url1,...url10] | url1
monkey |[url11,...url20]| url11
mouse |[url21,...url30]| url11
donkey |[url31,...url40]| "" - empty string
ball |[url41,...url50]| url45
monkey |[url11,...url20]| url1
Desired output:
query | hrefs | clicked_url
------------------------------------
car |[url1,...url10] | url1
monkey |[url11,...url20]| url11
donkey |[url31,...url40]| 0
ball |[url41,...url50]| url45
Thanks in advance!
Upvotes: 2
Views: 1839
Reputation: 8410
Pyspark way to do this:
For array_contains
you just have to use an expression F.expr
to be able to send the value
part as a column.
from pyspark.sql import functions as F
df.withColumn("clicked_url", F.when(F.col("clicked_url")=="", F.lit(0)).otherwise(F.col("clicked_url")))\
.withColumn("boolean", F.expr("""array_contains(hrefs,clicked_url)"""))\
.filter("boolean=true or clicked_url=0").drop("boolean").show()
+------+--------------------+-----------+
| query| hrefs|clicked_url|
+------+--------------------+-----------+
| car| [url1, url10]| url1|
|monkey| [url11, url20]| url11|
|donkey| [url31, url40]| 0|
| ball|[url41, url45, ur...| url45|
+------+--------------------+-----------+
As .filter
can also accept expression
, you can input the array_contains
in there only.
from pyspark.sql import functions as F
df.withColumn("clicked_url", F.when(F.col("clicked_url")=="", F.lit(0))\
.otherwise(F.col("clicked_url")))\
.filter("array_contains(hrefs,clicked_url)=true or clicked_url=0").show()
+------+--------------------+-----------+
| query| hrefs|clicked_url|
+------+--------------------+-----------+
| car| [url1, url10]| url1|
|monkey| [url11, url20]| url11|
|donkey| [url31, url40]| 0|
| ball|[url41, url45, ur...| url45|
+------+--------------------+-----------+
Upvotes: 2
Reputation: 9504
Try this:
df.createTempView('data')
new_df = spark.sqlContext.sql("""
select query
, hrefs
, case when clicked_url='' then '0' else clicked_url end as clicked_url
from data
where array_contains(hrefs, clicked_url) or clicked_url=''
""")
new_df.show()
Upvotes: 2