Reputation: 960
I created a hive table using the following syntax, pointed to an S3 folder:
CREATE EXTERNAL TABLE IF NOT EXISTS daily_input_file (
log_day STRING,
resource STRING,
request_type STRING,
format STRING,
mode STRING,
count INT
) row format delimited fields terminated by '\t' LOCATION 's3://my-bucket/my-folder';
When I execute a query, such as:
SELECT * FROM daily_input_file WHERE log_day IN ('20160508', '20160507');
I expect that records will be returned.
I have verified that this data is contained in the files in that folder. In fact, if I copy the file that contains this particular data into a new folder, create a table for that new folder and run the query, I get the results. I also get results from other files (in fact from most files) within the original folder.
The contents of s3://my-bucket/my-folder are simple. There are no subdirectories within my folder. There are two varieties of file names (a and b), all are prefixed with the date they were created (YYYYMMDD_), all have an extension of .txt000.gz. Here are some examples:
So what might be going on? Is there a limit to the number of files within a single folder that can be processed from S3? Or is something else the culprit?
Here are the versions used:
Upvotes: 0
Views: 1238
Reputation: 2068
The behavior being experienced with the S3 files is an issue with EMR release 4.7.0 and not a limitation of EMR.
Use EMR release 4.7.1 or later.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html
Upvotes: 1