Reputation: 808
I'm using Amazon EMR. I have some log data in s3, all in the same bucket, but under different subdirectories like:
I'm using :
Set hive.mapred.supports.subdirectories=true;
Set mapred.input.dir.recursive=true;
When trying to load all data from "s3://bucketname/2014/08/":
CREATE EXTERNAL TABLE table1(id string, at string,
custom struct<param1:string, param2:string>)
LOCATION 's3://bucketname/2014/08/';
In return I get:
Time taken: 0.169 seconds
When trying to query the table:
SELECT * FROM table1 LIMIT 10;
I get:
Failed with exception Not a file: s3://bucketname/2014/08/01
Does anyone has an idea on how to solev this?
Upvotes: 2
Views: 5826
Reputation: 1521
This works now (May 2018)
A global EMR_wide fix is to set the following in /etc/spark/conf/spark-defaults.conf
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
hive.mapred.supports.subdirectories true
Or, can be fixed locally like in following pyspark code:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
.config("hive.mapred.supports.subdirectories","true") \
Upvotes: 4
Reputation: 808
It's an EMR specific problem, here is what i got from Amazon support:
Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories. According to this document ("Are you trying to recursively traverse input directories?") Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.
Upvotes: 5
Reputation: 2305
The problem is the way you have specified the location
The hive external table expect files to be present at this location but it has folders.
Try putting path like
You need to provide path till files.
Upvotes: 1