dennislloydjr
dennislloydjr

Reputation: 960

Hive on EMR not reading all files at S3 location

I created a hive table using the following syntax, pointed to an S3 folder:

CREATE EXTERNAL TABLE IF NOT EXISTS daily_input_file ( 
        log_day STRING, 
        resource STRING, 
        request_type STRING, 
        format STRING, 
        mode STRING, 
        count INT 
) row format delimited fields terminated by '\t' LOCATION 's3://my-bucket/my-folder';

When I execute a query, such as:

SELECT * FROM daily_input_file WHERE log_day IN ('20160508', '20160507');

I expect that records will be returned.

I have verified that this data is contained in the files in that folder. In fact, if I copy the file that contains this particular data into a new folder, create a table for that new folder and run the query, I get the results. I also get results from other files (in fact from most files) within the original folder.

The contents of s3://my-bucket/my-folder are simple. There are no subdirectories within my folder. There are two varieties of file names (a and b), all are prefixed with the date they were created (YYYYMMDD_), all have an extension of .txt000.gz. Here are some examples:

So what might be going on? Is there a limit to the number of files within a single folder that can be processed from S3? Or is something else the culprit?

Here are the versions used:

Upvotes: 0

Views: 1238

Answers (1)

ChristopherB
ChristopherB

Reputation: 2068

The behavior being experienced with the S3 files is an issue with EMR release 4.7.0 and not a limitation of EMR.

Use EMR release 4.7.1 or later.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Upvotes: 1

Related Questions