Hive Partitioned, Bucketed and Sorted table - multiple inserts

Question

Hello and apologies for a long text,

We use hive 1.2.0 and have table partitioned by date, bucketed by customer id and sorted by customer id as well. (I know, it shouldn't be sorted per same thing, but it made sense to do it).

When we tested performance, we used already existing table, so each partition had 128 files (table had 128 buckets). Performance impact was great.

I noticed however that in real time environment, all inserts create additional 128 files. I was expecting to see Hive insert into already existing files. If we fill the table hourly, we could then expect 24 x 128 files, all sorted. But this is not really sorted any more - it's per file sorted. I'm sure it still counts, but it's going to take a hit at performance. Table had ~100M daily rows. And due to some subsequent / late entries, actual number of inserts into one partition could be larger than 24 - it could be 30-40.

We're currently looking at doing weekly compaction jobs, but it's really depressing :). Initially, this was a Spark pipeline, then required Hive inserting from a "Spark staging table" since Spark cannot write to such table, and now we're looking at the possibility of having a compaction job...

Any comments / suggestions?

Thanks in advance!

Strick · Accepted Answer

I did not get what you are exactly trying to ask.

But as per your question.

It is good idea that you created the sorted buckets on customer id it will sure give performance optimizations and mainly in joins like SMB and bucketed joins you will see great improvement.
It is known that if you insert into buckted table hive will create new buckets with _copy it will not edit or update the existing bucket instead it will create a new bucket with new values.

if you are at the designing phase of your application I would suggest you to create 1 more partition clause based on techincal partition that can be your run id. For example if you are running your process hourly so you can create your new partition based on hour or it can be any unique id that you can generate

So your directly structure will be like

table_name///bucketes_files

There is no way you can edit the existing buckets.

Other workaround is along with your main table create a temporary table what will hold you data for 1 day or so and join this table with new data and insert overwrite the main table in that way your bucktes will hold the all sorted customer_id and and there will be no copy of files as we are insert overwriting the data into existing table

Hive Partitioned, Bucketed and Sorted table - multiple inserts

Answers (1)

Related Questions