Reputation: 1271
I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file. However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one: in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive. Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Upvotes: 1
Views: 1400
Reputation: 1271
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder
Upvotes: 0
Reputation: 1040
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " .. For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Upvotes: 1