Leemosh
Leemosh

Reputation: 905

Pyspark - column item in another column list

I'm a newbie in pyspark and I'm looking for finding a column1 item in column2 list of items. Let's have an example.

Let's have an easy example with google search - 3 columns - query - string, hrefs - list and clicked_url - string and I need to check if clicked_url is in hrefs.

There are 3 possible scenarios that can occur:

If clicked_url is in hrefs list THEN keep the row

If clicked_url is not in hrefs list THEN filter out that row -remove. (you can imagine that as a click on advertisement)

If clicked_url is empty (there was no clicked_url) THEN keep the row but change the value of clicked_url to 0.

query  |    hrefs       | clicked_url 
------------------------------------
car    |[url1,...url10] | url1
monkey |[url11,...url20]| url11
mouse  |[url21,...url30]| url11
donkey |[url31,...url40]| "" - empty string
ball   |[url41,...url50]| url45
monkey |[url11,...url20]| url1

Desired output:

query  |    hrefs       | clicked_url 
------------------------------------
car    |[url1,...url10] | url1
monkey |[url11,...url20]| url11
donkey |[url31,...url40]| 0
ball   |[url41,...url50]| url45

Thanks in advance!

Upvotes: 2

Views: 1839

Answers (2)

murtihash
murtihash

Reputation: 8410

Pyspark way to do this:

For array_contains you just have to use an expression F.expr to be able to send the value part as a column.

from pyspark.sql import functions as F
df.withColumn("clicked_url", F.when(F.col("clicked_url")=="", F.lit(0)).otherwise(F.col("clicked_url")))\
  .withColumn("boolean", F.expr("""array_contains(hrefs,clicked_url)"""))\
  .filter("boolean=true or clicked_url=0").drop("boolean").show()

+------+--------------------+-----------+
| query|               hrefs|clicked_url|
+------+--------------------+-----------+
|   car|       [url1, url10]|       url1|
|monkey|      [url11, url20]|      url11|
|donkey|      [url31, url40]|          0|
|  ball|[url41, url45, ur...|      url45|
+------+--------------------+-----------+

As .filter can also accept expression, you can input the array_contains in there only.

from pyspark.sql import functions as F
df.withColumn("clicked_url", F.when(F.col("clicked_url")=="", F.lit(0))\
              .otherwise(F.col("clicked_url")))\
  .filter("array_contains(hrefs,clicked_url)=true or clicked_url=0").show()

+------+--------------------+-----------+
| query|               hrefs|clicked_url|
+------+--------------------+-----------+
|   car|       [url1, url10]|       url1|
|monkey|      [url11, url20]|      url11|
|donkey|      [url31, url40]|          0|
|  ball|[url41, url45, ur...|      url45|
+------+--------------------+-----------+

Upvotes: 2

Gabio
Gabio

Reputation: 9504

Try this:

df.createTempView('data')
new_df = spark.sqlContext.sql("""
select query
       , hrefs
       , case when clicked_url='' then '0' else clicked_url end as clicked_url
from data
where array_contains(hrefs, clicked_url) or clicked_url=''
""")
new_df.show()

Upvotes: 2

Related Questions