Incrementally adding to a Hive table w/Scala + Spark 1.3

Question

Our cluster has Spark 1.3 and Hive There is a large Hive table that I need to add randomly selected rows to. There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for the random rows to fill. What I did was do a query on that condition, table.where(value, then make it an array by using take(num rows). Then since all of these rows contain the information I need on which random rows are needed from the large hive table, I iterate through the array.



When I do the query I use ORDER BY RAND() in the query (using sqlContext).  I created a var  Hive table ( to be mutable) adding a column from the larger table. In the loop, I do a unionAll newHiveTable = newHiveTable.unionAll(random_rows)

I have tried many different ways to do this, but am not sure what is the best way to avoid CPU and temp disk use.  I know that Dataframes aren't intended for incremental adds.
One thing I have though now to try is to create a cvs file, write the random rows to that file incrementally in the loop, then when the loop is finished, load the cvs file as a table, and do one unionAll to get my final table.

Any feedback would be great. Thanks

user1314742 · Accepted Answer

I would recommend that you create an external table with hive, defining the location, and then let spark write the output as csv to that directory:

in Hive:

create external table test(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'
LOCATION '/SOME/HDFS/LOCATION'

And then from spark with the aide of https://github.com/databricks/spark-csv , write the dataframe to csv files and appending to the existing ones:

df.write.format("com.databricks.spark.csv").save("/SOME/HDFS/LOCATION/", SaveMode.Append)

Incrementally adding to a Hive table w/Scala + Spark 1.3

Answers (1)

Related Questions