erinnnn
erinnnn

Reputation: 11

How to check if a column is the key of the other dictionary column in pyspark?

i have a pyspark dataframe with two columns interval and dataMap

>> df.show()
 interval       dataMap
    0           {0 -> 10, 1 -> 6}
    1           {}
    2           {2 -> 8, 3 -> 7}
    3           {3 -> 9}

Now i want to add 1 more column to show whether column interval is a key in column dataMap column. The expected result like this:

>> df.show()
 interval       dataMap                 exist
    0           {0 -> 10, 1 -> 6}       true
    1           {}                      false
    2           {2 -> 8, 3 -> 7}        true
    3           {3 -> 9}                true

How should we achieve that? Do we have any function like:

df = (
       df.withColumn("exist", 
         when(col("interval").isin(col("dataMap")),False)
         .otherwise(True)  
       )
)

thanks

Upvotes: 1

Views: 555

Answers (2)

Luiz Viola
Luiz Viola

Reputation: 2436

df  = spark.createDataFrame(
  [
('0','{0 -> 10, 1 -> 6}'),
('1','{}'),
('2','{2 -> 8, 3 -> 7}'),
('3','{3 -> 9}')
  ], ['interval','dataMap']
)


from pyspark.sql import functions as F, types as T
from pyspark.sql.functions import udf

def valida(x, dic):
    return x in dic

valida_udf = udf(valida, T.BooleanType())

df\
    .withColumn('dic', F.regexp_replace('dataMap', '->', ':'))\
    .withColumn('exist', valida_udf('interval', 'dic'))\
    .drop('dic')\
    .show(truncate=False)
+--------+-----------------+-----+
|interval|dataMap          |exist|
+--------+-----------------+-----+
|0       |{0 -> 10, 1 -> 6}|true |
|1       |{}               |false|
|2       |{2 -> 8, 3 -> 7} |true |
|3       |{3 -> 9}         |true |
+--------+-----------------+-----+

Upvotes: 0

ScootCork
ScootCork

Reputation: 3686

You can use array_contains and map_keys to do so.

from pyspark.sql.functions import map_keys, col, array_contains

df = spark.sql(
    '''
    SELECT 1 as interval, map(1, 'a', 2, 'b') as dataMap
    UNION ALL
    SELECT 3 as interval, map(1, 'a', 2, 'b') as dataMap
    ''')

df = df.withColumn("exist", array_contains(map_keys(col('dataMap')), col("interval")))

df.show()

+--------+----------------+-----+
|interval|         dataMap|exist|
+--------+----------------+-----+
|       1|{1 -> a, 2 -> b}| true|
|       3|{1 -> a, 2 -> b}|false|
+--------+----------------+-----+

Upvotes: 3

Related Questions