Reputation:
I have configured data SparkStreaming. I would like to persist this data for a variety of goals:
exposing for Tableau (It requires thriftServer, while thriftServer requires hiveContext).
sometimes I would like to be able to update some data.
Where is data kept in HiveContext? In memory? On the local disk? Is it provided by thriftServer?
Upvotes: 2
Views: 945
Reputation: 2924
You could persist your DataFrames from spark to a hive table by doing:
yourDataFrame.saveAsTable("YourTableName")
If you want to insert data to existing table you could use:
yourDataFrame.writer().mode(SaveMode.Append).saveAsTable("YourTableName")
This will saves your DataFrame on a persistent Hive table. the location of this table will be dependent on the configuration in your hive-site.xml
.
By default, if you are testing locally, the location will be on your local disk on the location /user/hive/warehouse/YourTableName
If you are using Spark with Hive on Yarn/HDFS then the table will be saved on HDFS on the location defined by the property hive.metastore.warehouse.dir
in your hive-site.xml configuration file
Hope that will help :)
Upvotes: 1
Reputation: 9
You can choose to cache data on memory using
your_hive_context.cacheTable("table_name")
The Thrift Server access to a global-context that contains all the table, even the temporary ones.
If you cache the table Tableau will get the query results faster, but you have to keep running the Spark Batch application.
I did not find yet a way to update some of the data without opening a new HiveContext.
Upvotes: 0