Keithx
Keithx

Reputation: 3158

Showing tables from specific database with Pyspark and Hive

Having some databases and tables in them in Hive instance. I'd like to show tables for some specific database (let's say 3_db).

+------------------+--+
|  database_name   |
+------------------+--+
| 1_db             |
| 2_db             |
| 3_db             |
+------------------+--+

If I enter beeline from bash-nothing complex there, I just do the following:

show databases;
show tables from 3_db;

When I'm using pyspark via ipython notebeook- my cheap tricks are not working there and give me error on the second line (show tables from 3_db) instead:

sqlContext.sql('show databases').show()
sqlContext.sql('show tables from 3_db').show()

What seems to be wrong and why's the same code works in one place and don't work in another?

Upvotes: 29

Views: 73127

Answers (3)

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39930

There are two possible ways to achieve this, but they differ a lot in terms of efficiency.


Using SQL

This is the most efficient approach:

spark_session = SparkSession.builder.getOrCreate()
spark_session.sql("show tables in db_name").show()

Using catalog.listTables()

The following is more inefficient compared to the previous approach, as it also loads tables' metadata:

spark_session = SparkSession.builder.getOrCreate()
spark_session.catalog.listTables("db_name")

Upvotes: 7

aelesbao
aelesbao

Reputation: 513

Another possibility is to use the Catalog methods:

spark = SparkSession.builder.getOrCreate()
spark.catalog.listTables("3_db")

Just be aware that in PySpark this method returns a list and in Scala, it returns a DataFrame.

Upvotes: 18

David דודו Markovitz
David דודו Markovitz

Reputation: 44981

sqlContext.sql("show tables in 3_db").show()

Upvotes: 36

Related Questions