Reputation: 1653
I'm trying to understand Hive internals. What class/method hive uses to understand size of dataset in S3 ?
Upvotes: 0
Views: 93
Reputation: 5213
Hive is built on top of hadoop, and uses hadoop's HDFS as API for input/output.
More precisely, it has a InputFormat and OutputFormat that are configurable when you create a table that get data from a FileSystem object (https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html).
The FileSystem object abstracts most aspects of file management, so hive does not have to worry if a file is on S3 or HDFS as the hadoop/HDFS layer takes care of that.
When dealing with files, each file has a path that is a URL (for instance, hdfs:///dir/file or s3:///bucket/path ).
The Path
class resolves the filesystem using the getFileSystem method, which would be S3FileSystem for an S3 url.
From the FileSystem object, it can get the file size using the methods for FileStatus
using the getLen
method.
If you want to see where in the hive source this is done, it is usually in org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
which is the default setting for hive.input.format
.
Upvotes: 1