What is the best PySpark practice to dynamically retrieve/filter the Spark frame based on extracting widget values on top of notebook in DataBrick?

Question

Let's say I have following Spark dataframe so-called df:

                        <------Time-resolution-------->
+------------+----------+---------+---------+---------+
|    Name    |   date   |  00-24  |  00-12  |  12-24  |
+------------+----------+---------+---------+---------+
|     X1     |2020-10-20|   137   |   68    |   69    |
|     X2     |2020-10-22|   132   |   66    |   66    |
|     X3     |2020-10-24|   132   |   64    |   68    |
|     X4     |2020-10-25|   587   |   292   |   295   |
|     X5     |2020-10-29|   134   |   67    |   67    |
+------------+----------+---------+---------+---------+

I want to create 4 widgets on top of my notebook in DataBricks medium using PySpark in the form of the dbutils.widgets.dropdown() from available data as follows:

DATE_FROM
DATE_TO
Time_Resolution_Of_Interest (one of 00-24|00-12|12-24)
Name_Of_Interest (top 3 names based on descending sort of interested Time-resolution column)

what I have tried the following based on this answer & that answer:

I could manage to this for 1st and 2nd items as below:

dbutils.widgets.removeAll()

# compute the list of all dates from maximum date available till today
max_date = df.select(F.max('date')).first()['max(date)']
min_date = df.select(F.min('date')).first()['min(date)']
print(min_date)
print(max_date)

dbutils.widgets.dropdown(name = "DATE_FROM", defaultValue = min_date , choices = ['date'])
dbutils.widgets.dropdown(name = "DATE_TO", defaultValue = max_date, choices = ['date'])
#dbutils.widgets.text(name = "DATE_FROM", defaultValue = min_date")
#dbutils.widgets.text(name = "DATE_TO",   defaultValue = max_date)

for the 3rd item I just have stupid idea:

channel = ['00-24', '00-12', '12-24']
dbutils.widgets.dropdown(name = "Time_Resolution_Of_Interest", defaultValue = "00-24" , choices = [str(x) for x in channel] + ["None"])

For the last item I want to make list of interested names but I couldn't manage to map String and pass it like a Scala version

#Get interested Time resolution from widget
dropdownColumn = dbutils.widgets.get("Time_Resolution_Of_Interest")
# compute the list 5 top names in interested time resolution 
max_Top_Name = df.select(F.max(dropdownColumn)).first()[dropdownColumn]

NUM_OF_NAMES_FOR_DROPDOWN = 5

#Scala version works
#val Name_list = df.select("Name").take(NUM_OF_NAMES_FOR_DROPDOWN).map(i=>i.getAs[String]("Name"))
#dbutils.widgets.dropdown("Name", "X1", Name_list.toSeq , "Username Of Interes")

#PySpark version doesn't work
Name_list = df.select("Name").take(NUM_OF_NAMES_FOR_DROPDOWN).rdd.flatMap(lambda x: x).collect()
dbutils.widgets.dropdown(name = "Name", defaultValue = max_Top_Name , choices = [str(x) for x in Name_list] + ["None"])

in the end I want to filter the records for that specific Name and selected time resolution over time and update the frame and according to this answer as below:

selected_widgets = ['DATE_FROM', 'DATE_TO', 'Time_Resolution_Of_Interest', 'Name_Of_Interest']
myList = getArgument(selected_widgets).split(",")
display(df.filter(df.isin(myList)))

I expected to reach following table for let's say via widgets values Name: X1 and Time-resolution: 00-24 over certain time date from 2020-10-20 till 2020-11-20:

+------------+----------+---------+
|    Name    |   date   |  00-24  | 
+------------+----------+---------+
|     X1     |2020-10-20|   137   |  
|     X1     |2020-10-21|   111   | 
|     X1     |2020-10-22|   99    | 
|     X1     |2020-10-23|   123   | 
|     X1     |2020-10-24|   101   |  
|    ...     |    ...   |   ...   |  
+------------+----------+---------+

rs. · Accepted Answer

What you could do is first build widgets like you are doing and get individual values from widget and filter them to get end result. See sample code below, this may not match 1-1 to your requirement but should guide you to get to what you want.

Create date widgets:

from pyspark.sql.functions import min, max

dbutils.widgets.removeAll()

# compute the list of all dates from maximum date available till today
date = [date[0] for date in data.select("date").collect()]
max_min_date = data.select(max('date'),min('date')).first()
min_date = max_min_date['min(date)']
max_date = max_min_date['max(date)']
print(date)
print(min_date)
print(max_date)

dbutils.widgets.dropdown(name = "DATE_FROM", defaultValue = min_date , choices = date)
dbutils.widgets.dropdown(name = "DATE_TO", defaultValue = max_date, choices = date)

Create Time Resolution Widget using schema, this will allow you to build dynamic list of time columns:

channel = [f.name for f in data.schema.fields if f.name not in ['name', 'date']]
print(channel)
dbutils.widgets.dropdown(name = "Time_Resolution_Of_Interest", defaultValue = "00-24" , choices = [str(x) for x in channel] + ["None"])

Create Name widget:

from pyspark.sql.functions import col
dropdownColumn = dbutils.widgets.get("Time_Resolution_Of_Interest")
NUM_OF_NAMES_FOR_DROPDOWN = 5

#sort by selected time column desc and take 5 rows
name_limit = [name[0] for name in 
data.select("Name").orderBy(col(dropdownColumn), ascending=False).take(NUM_OF_NAMES_FOR_DROPDOWN)]
dbutils.widgets.dropdown(name = "Name", defaultValue = 'X1' , choices = [str(x) for x in name_limit] + ["None"])

Finally, filter data based on widget values:

date_from_val = dbutils.widgets.get("DATE_FROM")
date_to_val = dbutils.widgets.get("DATE_TO")
time_val = dbutils.widgets.get("Time_Resolution_Of_Interest")
name_val = dbutils.widgets.get("Name")

result = data.select("name", time_val).where(f"name = '{name_val}' and date between '{date_from_val}' and  '{date_to_val}'")

display(result)

What is the best PySpark practice to dynamically retrieve/filter the Spark frame based on extracting widget values on top of notebook in DataBrick?

Answers (1)

Related Questions