Reputation: 7732
I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue.
The process that I did was:
1 - Run Apache Spark script to generate 30 millions rows partitioned by date at S3 stored by Orc.
2 - Run a Athena query to create the external table.
3 - Checked the table at EMR connected with Glue Data Catalog and it worked perfect. Both Spark and Hive were able to access.
4 - Generate another 30 millions rows in other folder partitioned by date. In Orc format
5 - Ran the Glue Crawler that identify the new table. Added to Data Catalog and Athena was able to do the query. But Spark and Hive aren't able to do it. See the exception below:
Spark
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcStruct
Hive
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating audit_id (state=,code=0)
I was checking if was any serialisation problem and I found this:
Input format org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
Output format org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
orc.compress SNAPPY
Input format org.apache.hadoop.mapred.TextInputFormat
Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Serde serialization lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
So, this is not working to read from Hive or Spark. It works for Athena. I already changed the configurations but with no effect at Hive or Spark.
Anyone faced that problem?
Upvotes: 15
Views: 2887
Reputation: 7732
Well,
After few weeks that I posted this question AWS fixed the problem. As I showed above, the problem was real and that was a bug from Glue.
As it is a new product and still have some problems some times.
But this was solved properly. See the properties of the table now:
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
Upvotes: 4