How hive understands the size of input data?

Question

I'm trying to understand Hive internals. What class/method hive uses to understand size of dataset in S3 ?

Roberto Congiu · Accepted Answer

Hive is built on top of hadoop, and uses hadoop's HDFS as API for input/output. More precisely, it has a InputFormat and OutputFormat that are configurable when you create a table that get data from a FileSystem object (https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html). The FileSystem object abstracts most aspects of file management, so hive does not have to worry if a file is on S3 or HDFS as the hadoop/HDFS layer takes care of that. When dealing with files, each file has a path that is a URL (for instance, hdfs:///dir/file or s3:///bucket/path ). The Path class resolves the filesystem using the getFileSystem method, which would be S3FileSystem for an S3 url. From the FileSystem object, it can get the file size using the methods for FileStatus using the getLen method.

If you want to see where in the hive source this is done, it is usually in org.apache.hadoop.hive.ql.io.CombineHiveInputFormat which is the default setting for hive.input.format.

How hive understands the size of input data?

Answers (1)

Related Questions