egens
egens

Reputation: 41

Concatenating hive table after adding columns breaks spark read

Given some table manipulation – create table with 2 rows and columns, add 3rd column and insert third row with 3 values

CREATE TABLE concat_test(
  one string,
  two string
  )
STORED AS ORC;
INSERT INTO TABLE concat_test VALUES (1,1), (2,2);
ALTER TABLE concat_test ADD COLUMNS (three string);
INSERT INTO TABLE concat_test VALUES (3,3,3);
alter table concat_test concatenate;

I'm having an exception Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 when I try reading it with Spark

spark.sql("select * from concat_test").collect()

It is obviously connected with columns number. I'm further investigating problem in orc. I didn't find quick fix for such partitions nor the bug described elsewhere. Is there one?

Could anyone try this on the latest hadoop versions? Does the bug exist?

Hive 1.2.1, Spark 2.3.2

UPD. I myself fixed my tables via Hive. Hive queries do work after this manipulation so I created copy tables and did select-insert of the old data to them.

Upvotes: 1

Views: 451

Answers (2)

Matt Andruff
Matt Andruff

Reputation: 5125

Here's a work around if you know that the underlying orc files aren't in the correct format and want to correct the format.

val s = Seq(("apple","apples"),("car","cars")) // create data
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val df1 = sc.parallelize(t).toDF("Sub_Cat", "Count")                                                      
val df2 = sc.parallelize(s).toDF("Main_Cat","Sub_Cat")
df1.write.format("orc").save("category_count")
df2.write.format("orc").save("categories")
val schema = StructType( Array( StructField("Main_Cat", StringType, nullable = true), StructField("Sub_Cat", StringType, nullable = true),StructField("Count", IntegerType, nullable = true)) )
val CorrectedSchema = spark.read.schema(schema).org("category_count")
CorrectedSchema.show()

This helps to correct Schema into the format you intend. If you trust the hive schema you can use this cheat to get the schema.(and reduce the typing)

val schema = spark.sql("select * from concat_test limit 0").schema

Upvotes: 0

Matt Andruff
Matt Andruff

Reputation: 5125

I have totally run into this issue before!

This is a known issue.

Hive only does schema on read, so there is no reason it should detect this as an issue and will happily let you define any definition you want. And the data underlying the table does NOT get updated when you change the definition of the hive table. Generally I have fixed the issue by fixing the underlying ORC files to meet the hive definition. You could read the ORC files directly as that issue has been fixed now as a work around.

Upvotes: 0

Related Questions