Reputation: 4701
I have a hive table which is partitioned on many countries. I want to load specific partition data to my dataframe, as shown below:
df=spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table").where('country="NCL"' && 'county="RUS"')
It's giving me an error, though I was able to load for single partition.
below is my directory structure in hdfs
/apps/hive/warehouse/emp.db/partition_load_table/country=NCL
df=spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table").where('country="NCL"')
Upvotes: 0
Views: 1348
Reputation: 425
Not sure why you don't just query the hive table directly using HQLContext:
spark.sql("select * from partition_load_table where country in ('NCL', 'RUS')")
If for some reason that is not available you can union the underlying hive partitions. First read them in as separate dataframes and union. Something like:
rus = spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table/country=rus")
ncl = spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table/country=ncl")
df = rus.union(ncl)
Upvotes: 1