Reputation: 53
I need all the table columns at a time which present in the particular DB in Azure Data bricks. I know the approach to find the sql server by using the following query. I need same kind of implementation in databricks also.
select schema_name(tab.schema_id) as schema_name,
tab.name as table_name,
col.column_id,
col.name as column_name,
t.name as data_type,
col.max_length,
col.precision
from sys.tables as tab
inner join sys.columns as col
on tab.object_id = col.object_id
left join sys.types as t
on col.user_type_id = t.user_type_id
order by schema_name,
table_name,
column_id;
Upvotes: 0
Views: 6739
Reputation: 321
Actually there are many way to achieve this, one is to play with some basic python dataframe, list and loop(Sorry for my poor English)
Let's play....
1st you have to retrieve all table name and with those table name retrieve table description that contain all column name with data type. we use 2 spark sql query
1: Table_name = spark.sql("SHOW TABLES FROM default")
(default
databricks default database name)
result
+--------+--------------+-----------+ |database| tableName|isTemporary| +--------+--------------+-----------+ | default|convertedtable| false| | default| delta_table| false| | default| display| false| | default| events| false| | default| export_csv| false| | default| flights| false| | default| play_store| false| +--------+--------------+-----------+
2: table_describe = spark.sql("DESCRIBE default.play_store")
result
+--------------+---------+-------+ | col_name|data_type|comment| +--------------+---------+-------+ | App| string| null| | Category| string| null| | Rating| string| null| | Reviews| string| null| | Size| string| null| | Installs| string| null| | Type| string| null| | Price| string| null| |Content Rating| string| null| | Genres| string| null| | Last Updated| string| null| | Current Ver| string| null| | Android Ver| string| null| +--------------+---------+-------+
Now the main part
Table_name = spark.sql("SHOW TABLES FROM default")
Table_name=Table_name.select('tableName').collect()
mvv_array = [(row.tableName) for row in Table_name]
cnt = 0
for idx,i in enumerate(mvv_array):
if cnt == 0:
tmp = spark.sql('DESCRIBE default.'+i).withColumn('table_name',lit(i))
temp_df = tmp
cnt = cnt + 1
else:
tmp = spark.sql('DESCRIBE default.'+i).withColumn('table_name',lit(i))
temp_df=temp_df.unionAll(tmp)
temp_df.show()
TADAAAA.....
+-----------------+---------+-------+--------------+ | col_name|data_type|comment| table_name| +-----------------+---------+-------+--------------+ | Year| int| null|convertedtable| | Month| int| null|convertedtable| | DayofMonth| int| null|convertedtable| | DayOfWeek| int| null|convertedtable| | DepTime| string| null|convertedtable| | CRSDepTime| int| null|convertedtable| | ArrTime| string| null|convertedtable| | CRSArrTime| int| null|convertedtable| | UniqueCarrier| string| null|convertedtable| | FlightNum| int| null|convertedtable| | TailNum| string| null|convertedtable| |ActualElapsedTime| string| null|convertedtable| | CRSElapsedTime| string| null|convertedtable| | AirTime| string| null|convertedtable| | ArrDelay| string| null|convertedtable| | DepDelay| string| null|convertedtable| | Origin| string| null|convertedtable| | Dest| string| null|convertedtable| | Distance| int| null|convertedtable| | TaxiIn| string| null|convertedtable| +-----------------+---------+-------+--------------+ only showing top 20 rows
Take what ever you need from new dataframe.
Hope its fulfill what you want.
If you find it as your solution don't forget to mark as Answer and up vote.
Upvotes: 3
Reputation: 322
The answer by Sohel Reza is good, but it doesn't solve it in case you want to get nested columns (StructType). It only shows the highest level column name.
If you want a simple view which lists all the databases, tables, and columns, I've written about one such solution here: https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
It creates a view with database, table name and column name.
You can also copy/paste the full code from below:
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```
Upvotes: 2