Hive storage in Hadoop, interesting finding but don't understand

Question

Here is a finding on hive/hadoop, I have a table called titles, I splitted it into two portions, one is titles20000 and the other is titles20000more, the row counts look good, however the data size look different, see the screenshot here which is from the namenode by typing "host address:50070" in browser:

Look at the Block Size, the first table titles has 4 blocks, while the splitted sub tables only have 1 for each.

I also checked the dataSize another way by executing show property in hive:

I did a quick calculation on row counts:

n = titles: 443309
n1 = titles20000: 14781
n2 = titles20000more: 428528
n = n1 + n2 = 443309
% of n1 =  3%
% of n2 = 97%

This is correct.

I then did another quick calculation on totalSize:

n = titles: 19934943
n1 = where emp_no < 20000: 624642
n2 = where emp_no >=20000: 18423685
n1+n2 = 19048327 < n

Apparently this matches the previous observation, the question is:

For the original table titles, it used 4 128MB blocks For the splitted second table titles20000more, it contains 97% of rows and yet uses only 1 128MB Block

In the first screenshot, what's the meaning of Size (the 4th column)?

How could this happenes?

Hive storage in Hadoop, interesting finding but don't understand

Answers (1)

Related Questions

Hive storage in Hadoop, interesting finding but don&#39;t understand

Answers (1)

Related Questions

Hive storage in Hadoop, interesting finding but don't understand