Reputation: 1976
Let's say I have following Spark dataframe so-called df
:
<------Time-resolution-------->
+------------+----------+---------+---------+---------+
| Name | date | 00-24 | 00-12 | 12-24 |
+------------+----------+---------+---------+---------+
| X1 |2020-10-20| 137 | 68 | 69 |
| X2 |2020-10-22| 132 | 66 | 66 |
| X3 |2020-10-24| 132 | 64 | 68 |
| X4 |2020-10-25| 587 | 292 | 295 |
| X5 |2020-10-29| 134 | 67 | 67 |
+------------+----------+---------+---------+---------+
I want to create 4 widgets on top of my notebook in DataBricks medium using PySpark in the form of the dbutils.widgets.dropdown()
from available data as follows:
00-24
|00-12
|12-24
)what I have tried the following based on this answer & that answer:
I could manage to this for 1st and 2nd items as below:
dbutils.widgets.removeAll()
# compute the list of all dates from maximum date available till today
max_date = df.select(F.max('date')).first()['max(date)']
min_date = df.select(F.min('date')).first()['min(date)']
print(min_date)
print(max_date)
dbutils.widgets.dropdown(name = "DATE_FROM", defaultValue = min_date , choices = ['date'])
dbutils.widgets.dropdown(name = "DATE_TO", defaultValue = max_date, choices = ['date'])
#dbutils.widgets.text(name = "DATE_FROM", defaultValue = min_date")
#dbutils.widgets.text(name = "DATE_TO", defaultValue = max_date)
for the 3rd item I just have stupid idea:
channel = ['00-24', '00-12', '12-24']
dbutils.widgets.dropdown(name = "Time_Resolution_Of_Interest", defaultValue = "00-24" , choices = [str(x) for x in channel] + ["None"])
For the last item I want to make list of interested names but I couldn't manage to map String and pass it like a Scala version
#Get interested Time resolution from widget
dropdownColumn = dbutils.widgets.get("Time_Resolution_Of_Interest")
# compute the list 5 top names in interested time resolution
max_Top_Name = df.select(F.max(dropdownColumn)).first()[dropdownColumn]
NUM_OF_NAMES_FOR_DROPDOWN = 5
#Scala version works
#val Name_list = df.select("Name").take(NUM_OF_NAMES_FOR_DROPDOWN).map(i=>i.getAs[String]("Name"))
#dbutils.widgets.dropdown("Name", "X1", Name_list.toSeq , "Username Of Interes")
#PySpark version doesn't work
Name_list = df.select("Name").take(NUM_OF_NAMES_FOR_DROPDOWN).rdd.flatMap(lambda x: x).collect()
dbutils.widgets.dropdown(name = "Name", defaultValue = max_Top_Name , choices = [str(x) for x in Name_list] + ["None"])
in the end I want to filter the records for that specific Name and selected time resolution over time and update the frame and according to this answer as below:
selected_widgets = ['DATE_FROM', 'DATE_TO', 'Time_Resolution_Of_Interest', 'Name_Of_Interest']
myList = getArgument(selected_widgets).split(",")
display(df.filter(df.isin(myList)))
I expected to reach following table for let's say via widgets values Name: X1
and Time-resolution: 00-24
over certain time date
from 2020-10-20
till 2020-11-20
:
+------------+----------+---------+
| Name | date | 00-24 |
+------------+----------+---------+
| X1 |2020-10-20| 137 |
| X1 |2020-10-21| 111 |
| X1 |2020-10-22| 99 |
| X1 |2020-10-23| 123 |
| X1 |2020-10-24| 101 |
| ... | ... | ... |
+------------+----------+---------+
Upvotes: 0
Views: 1693
Reputation: 27467
What you could do is first build widgets like you are doing and get individual values from widget and filter them to get end result. See sample code below, this may not match 1-1 to your requirement but should guide you to get to what you want.
Create date widgets:
from pyspark.sql.functions import min, max
dbutils.widgets.removeAll()
# compute the list of all dates from maximum date available till today
date = [date[0] for date in data.select("date").collect()]
max_min_date = data.select(max('date'),min('date')).first()
min_date = max_min_date['min(date)']
max_date = max_min_date['max(date)']
print(date)
print(min_date)
print(max_date)
dbutils.widgets.dropdown(name = "DATE_FROM", defaultValue = min_date , choices = date)
dbutils.widgets.dropdown(name = "DATE_TO", defaultValue = max_date, choices = date)
Create Time Resolution Widget using schema, this will allow you to build dynamic list of time columns:
channel = [f.name for f in data.schema.fields if f.name not in ['name', 'date']]
print(channel)
dbutils.widgets.dropdown(name = "Time_Resolution_Of_Interest", defaultValue = "00-24" , choices = [str(x) for x in channel] + ["None"])
Create Name widget:
from pyspark.sql.functions import col
dropdownColumn = dbutils.widgets.get("Time_Resolution_Of_Interest")
NUM_OF_NAMES_FOR_DROPDOWN = 5
#sort by selected time column desc and take 5 rows
name_limit = [name[0] for name in
data.select("Name").orderBy(col(dropdownColumn), ascending=False).take(NUM_OF_NAMES_FOR_DROPDOWN)]
dbutils.widgets.dropdown(name = "Name", defaultValue = 'X1' , choices = [str(x) for x in name_limit] + ["None"])
Finally, filter data based on widget values:
date_from_val = dbutils.widgets.get("DATE_FROM")
date_to_val = dbutils.widgets.get("DATE_TO")
time_val = dbutils.widgets.get("Time_Resolution_Of_Interest")
name_val = dbutils.widgets.get("Name")
result = data.select("name", time_val).where(f"name = '{name_val}' and date between '{date_from_val}' and '{date_to_val}'")
display(result)
Upvotes: 0