Reputation: 177
I am trying to parse out the Location from Hive partitioned table in Spark using this query:
val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")
I was not able to find any query or any other way in Spark to specifically select Location column from this query.
Upvotes: 6
Views: 9231
Reputation: 527
Didnt find an answer in Pyspark
table_location = spark.sql("describe formatted DB.TableName").filter((F.col('col_name')=='Location')).select("data_type").toPandas().astype(str)['data_type'].values[0]
Upvotes: 2
Reputation: 31540
You can also use .toDF
method on desc formatted table
then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
Upvotes: 3
Reputation: 149
df.inputFiles method in dataframe API will print file path. It returns a best-effort snapshot of the files that compose this DataFrame.
spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)
Upvotes: 3
Reputation: 840
You can use spark's utility of table reading:
spark.read.table("myDB.myTable").select(input_file_name).take(1)
Will result in a string like:
19/06/18 09:59:55 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res1: Array[org.apache.spark.sql.Row] = Array([hdfs://nameservice1/my/path/to/table/store/part-00000-d439163c-9fc4-4768-8b0b-c963f5f7d3d2.snappy.parquet])
I used take(1)
only to print one row to show the result here. You may not want to use it if you want all the locations.
From this result you can parse the string accordingly in case you want only the location part.
Upvotes: 2