Filter dataframe by key in a list pyspark

Question

I have dataframe:

 d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
      ({'Mayor': 2, 'Indiana': 2}, 4),
      ({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
df1 = spark.createDataFrame(d1, ['dct', 'count'])
df1.show()

 ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']

I want to write two functions: first function filters keys for the dct column that are not in the ignore_list and the second function filters if the keys are in filter_lst

Thus there will be two columns that contain dictionaries with keys filtered by ignore_list and filter_lst

Bartosz Gajda · Accepted Answer

These two UDFs should be sufficient for your case:

from pyspark.sql.functions import col

d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
      ({'Mayor': 2, 'Indiana': 2}, 4),
      ({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']

df1 = spark.createDataFrame(d1, ['dct', 'count'])

@udf
def apply_ignore_lst(dct):
    return {k:v for k, v in dct.items() if k not in ignore_lst}

@udf
def apply_filter_lst(dct):
    return {k:v for k, v in dct.items() if k in filter_lst}

df1.withColumn("apply_ignore_lst", apply_ignore_lst(col("dct"))).withColumn("apply_filter_lst", apply_filter_lst(col("apply_ignore_lst"))).show(truncate=False)

+----------------------------------------------------------+-----+----------------------------------------------+----------------+
|dct                                                       |count|apply_ignore_lst                              |apply_filter_lst|
+----------------------------------------------------------+-----+----------------------------------------------+----------------+
|{the town -> 1, County Council s -> 2, email -> 5}        |2    |{the town=1, email=5}                         |{}              |
|{Indiana -> 2, Mayor -> 2}                                |4    |{Mayor=2}                                     |{Mayor=2}       |
|{Justice -> 2, Congress -> 2, country -> 2, veterans -> 1}|6    |{Congress=2, Justice=2, country=2, veterans=1}|{Congress=2}    |
+----------------------------------------------------------+-----+----------------------------------------------+----------------+

Filter dataframe by key in a list pyspark

Answers (2)

Related Questions