Rory
Rory

Reputation: 383

Filter dataframe by key in a list pyspark

I have dataframe:

 d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
      ({'Mayor': 2, 'Indiana': 2}, 4),
      ({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
df1 = spark.createDataFrame(d1, ['dct', 'count'])
df1.show()

 ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']

I want to write two functions: first function filters keys for the dct column that are not in the ignore_list and the second function filters if the keys are in filter_lst

Thus there will be two columns that contain dictionaries with keys filtered by ignore_list and filter_lst

Upvotes: 0

Views: 514

Answers (2)

Azhar Khan
Azhar Khan

Reputation: 4118

It can be done in one-liner using map_filter:

df1 \
  .withColumn("ignored", F.map_filter("dct", lambda k, _: ~k.isin(ignore_lst))) \
  .withColumn("filtered", F.map_filter("dct", lambda k, _: k.isin(filter_lst)))

Full example:

d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
      ({'Mayor': 2, 'Indiana': 2}, 4),
      ({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
df1 = spark.createDataFrame(d1, ['dct', 'count'])

ignore_lst = ['County Council s', 'emal', 'Indiana']
filter_lst = ['Congress', 'town', 'Mayor', 'Indiana']

df1 = df1 \
        .withColumn("ignored", F.map_filter("dct", lambda k, _: ~k.isin(ignore_lst))) \
        .withColumn("filtered", F.map_filter("dct", lambda k, _: k.isin(filter_lst)))

[Out]:
+----------------------------------------------------------+--------------------------+
|ignored                                                   |filtered                  |
+----------------------------------------------------------+--------------------------+
|{the town -> 1, email -> 5}                               |{}                        |
|{Mayor -> 2}                                              |{Indiana -> 2, Mayor -> 2}|
|{Justice -> 2, Congress -> 2, country -> 2, veterans -> 1}|{Congress -> 2}           |
+----------------------------------------------------------+--------------------------+

Upvotes: 1

Bartosz Gajda
Bartosz Gajda

Reputation: 1167

These two UDFs should be sufficient for your case:

from pyspark.sql.functions import col

d1 = [({'the town': 1, 'County Council s': 2, 'email':5},2),
      ({'Mayor': 2, 'Indiana': 2}, 4),
      ({'Congress': 2, 'Justice': 2,'country': 2, 'veterans':1},6)
]
ignore_lst = ['County Council s', 'emal','Indiana']
filter_lst = ['Congress','town','Mayor', 'Indiana']

df1 = spark.createDataFrame(d1, ['dct', 'count'])

@udf
def apply_ignore_lst(dct):
    return {k:v for k, v in dct.items() if k not in ignore_lst}

@udf
def apply_filter_lst(dct):
    return {k:v for k, v in dct.items() if k in filter_lst}

df1.withColumn("apply_ignore_lst", apply_ignore_lst(col("dct"))).withColumn("apply_filter_lst", apply_filter_lst(col("apply_ignore_lst"))).show(truncate=False)

+----------------------------------------------------------+-----+----------------------------------------------+----------------+
|dct                                                       |count|apply_ignore_lst                              |apply_filter_lst|
+----------------------------------------------------------+-----+----------------------------------------------+----------------+
|{the town -> 1, County Council s -> 2, email -> 5}        |2    |{the town=1, email=5}                         |{}              |
|{Indiana -> 2, Mayor -> 2}                                |4    |{Mayor=2}                                     |{Mayor=2}       |
|{Justice -> 2, Congress -> 2, country -> 2, veterans -> 1}|6    |{Congress=2, Justice=2, country=2, veterans=1}|{Congress=2}    |
+----------------------------------------------------------+-----+----------------------------------------------+----------------+

Upvotes: 1

Related Questions