user2949241
user2949241

Reputation: 87

unable to load data from parquet files to hive external table

I have written below scala code to create parquet file

scala> case class Person(name:String,age:Int,sex:String)
defined class Person

scala> val data = Seq(Person("jack",25,"m"),Person("john",26,"m"),Person("anu",27,"f"))
data: Seq[Person] = List(Person(jack,25,m), Person(john,26,m), Person(anu,27,f))

scala> import  sqlContext.implicits._
import sqlContext.implicits._

scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

scala> df.select("name","age","sex").write.format("parquet").mode("overwrite").save("sparksqloutput/person")

HDFS status:

[cloudera@quickstart ~]$ hadoop fs -ls sparksqloutput/person
Found 4 items
-rw-r--r--   1 cloudera cloudera          0 2017-08-14 23:03 sparksqloutput/person/_SUCCESS
-rw-r--r--   1 cloudera cloudera        394 2017-08-14 23:03 sparksqloutput/person/_common_metadata
-rw-r--r--   1 cloudera cloudera        721 2017-08-14 23:03 sparksqloutput/person/_metadata
-rw-r--r--   1 cloudera cloudera        773 2017-08-14 23:03 sparksqloutput/person/part-r-00000-2dd2f334-1985-42d6-9dbf-16b0a51e53a8.gz.parquet

Then I have created an external hive table using command below

hive> CREATE EXTERNAL TABLE person (name STRING,age INT,sex STRING) STORED AS PARQUET LOCATION '/sparksqlouput/person/';
OK
Time taken: 0.174 seconds
hive> select * from person
    > ;
OK
Time taken: 0.125 seconds

But while run above select query no rows returned. Kindly someone help on this.

Upvotes: 0

Views: 1328

Answers (1)

Shubhangi
Shubhangi

Reputation: 2254

In general, the hive sql statement 'select * from <table>' simply locates table directory where table data exist and dumps file contents from that hdfs directory.

In your case select * is not working it means location is not correct.

Please note, in scala your last statement contains save("sparksqloutput/person"), where "sparksqloutput/person" is relative path and it will expand to "/user/<logged in username>/sparksqloutput/person" (i.e. "/user/cloudera/sparksqloutput/person").

Hence while creating hive table you should use "/user/cloudera/sparksqloutput/person" instead of "/sparksqloutput/person". Practically "/sparksqloutput/person" does not exist and hence you did not get any output in select * from person.

Upvotes: 1

Related Questions