PasLeChoix
PasLeChoix

Reputation: 311

Hive storage in Hadoop, interesting finding but don't understand

Here is a finding on hive/hadoop, I have a table called titles, I splitted it into two portions, one is titles20000 and the other is titles20000more, the row counts look good, however the data size look different, see the screenshot here which is from the namenode by typing "host address:50070" in browser: enter image description here

Look at the Block Size, the first table titles has 4 blocks, while the splitted sub tables only have 1 for each.

I also checked the dataSize another way by executing show property in hive: enter image description here

I did a quick calculation on row counts:

n = titles: 443309
n1 = titles20000: 14781
n2 = titles20000more: 428528
n = n1 + n2 = 443309
% of n1 =  3%
% of n2 = 97%

This is correct.

I then did another quick calculation on totalSize:

n = titles: 19934943
n1 = where emp_no < 20000: 624642
n2 = where emp_no >=20000: 18423685
n1+n2 = 19048327 < n

Apparently this matches the previous observation, the question is:

For the original table titles, it used 4 128MB blocks For the splitted second table titles20000more, it contains 97% of rows and yet uses only 1 128MB Block

In the first screenshot, what's the meaning of Size (the 4th column)?

How could this happenes?

Upvotes: 0

Views: 31

Answers (1)

Ben Watson
Ben Watson

Reputation: 5521

Size is the actual size of the data.

Block Size is the size of the block within which the data is stored.

Your original table uses four blocks as its data was created by a Map-only job using four Mappers. When the data was copied into the other tables, it appears to have been merged into a single block.

Upvotes: 1

Related Questions