Vin
Vin

Reputation: 177

Getting HDFS Location of Hive Table in Spark

I am trying to parse out the Location from Hive partitioned table in Spark using this query:

val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")

I was not able to find any query or any other way in Spark to specifically select Location column from this query.

Upvotes: 6

Views: 9231

Answers (4)

Rafa
Rafa

Reputation: 527

Didnt find an answer in Pyspark

table_location = spark.sql("describe formatted DB.TableName").filter((F.col('col_name')=='Location')).select("data_type").toPandas().astype(str)['data_type'].values[0]

Upvotes: 2

notNull
notNull

Reputation: 31540

You can also use .toDF method on desc formatted table then filter from dataframe.

DataframeAPI:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString

Result:

String = hdfs://nn:8020/location/part_table

(or)

RDD Api:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc

Result:

String = /location/part_table

Upvotes: 3

SantoshK
SantoshK

Reputation: 149

df.inputFiles method in dataframe API will print file path. It returns a best-effort snapshot of the files that compose this DataFrame.

spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)

Upvotes: 3

knowone
knowone

Reputation: 840

You can use spark's utility of table reading:

spark.read.table("myDB.myTable").select(input_file_name).take(1) Will result in a string like:

19/06/18 09:59:55 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res1: Array[org.apache.spark.sql.Row] = Array([hdfs://nameservice1/my/path/to/table/store/part-00000-d439163c-9fc4-4768-8b0b-c963f5f7d3d2.snappy.parquet])

I used take(1) only to print one row to show the result here. You may not want to use it if you want all the locations. From this result you can parse the string accordingly in case you want only the location part.

Upvotes: 2

Related Questions