Reputation: 195
I am trying to get the min and max date from each table in a database in databricks.
I already started with the following:
results =[]
tables_list = spark.sql('show tables in table')
for row in tables_list.collect():
tablename = row.tableName
min_update = spark.sql(f"SELECT MIN(date) FROM table.{tablename}").collect()[0][0]
max_update = spark.sql(f"SELECT MAX(date) FROM table.{tablename}").collect()[0][0]
results.append((tablename,min_update,max_update))
df_update = spark.createDataFrame(results, schema = ['tablename', 'min_update', 'max_update'])
but when I df_update.display()
I only get the last table where I expect to have 60 tables each with the max and min dates for each row of table.
Upvotes: 0
Views: 504
Reputation: 6114
Since you want to get the max and min date from all the tables, you need to include the append operation inside the for loop.
Look at the following where I have modified the same code and got the required results. The following is a dataframe after executing show tables:
results =[]
tables_list = spark.sql('show tables in default')
tables_list.show()
for row in tables_list.collect():
tablename = row.tableName
#default is the name of my database and dt is date column
min_update = spark.sql(f"SELECT MIN(dt) FROM default.{tablename}").collect()[0][0]
max_update = spark.sql(f"SELECT MAX(dt) FROM default.{tablename}").collect()[0][0]
results.append((tablename,min_update,max_update))
df_update = spark.createDataFrame(results, schema = ['tablename', 'min_update', 'max_update'])
df_update.display()
Upvotes: 1