vikrant rana
vikrant rana

Reputation: 4701

Filtering Hive partition table in pyspark

I have a hive table which is partitioned on many countries. I want to load specific partition data to my dataframe, as shown below:

df=spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table").where('country="NCL"' && 'county="RUS"')

It's giving me an error, though I was able to load for single partition.

below is my directory structure in hdfs

/apps/hive/warehouse/emp.db/partition_load_table/country=NCL

df=spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table").where('country="NCL"')

Upvotes: 0

Views: 1348

Answers (1)

Tim
Tim

Reputation: 425

Not sure why you don't just query the hive table directly using HQLContext:

spark.sql("select * from partition_load_table where country in ('NCL', 'RUS')")

If for some reason that is not available you can union the underlying hive partitions. First read them in as separate dataframes and union. Something like:

rus = spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table/country=rus") ncl = spark.read.orc("/apps/hive/warehouse/emp.db/partition_load_table/country=ncl") df = rus.union(ncl)

Upvotes: 1

Related Questions