Rougher
Rougher

Reputation: 852

Spark persist function in reusing dataset

Let's say I created dataset by different transformations (join, map and etc) and saved it to table A in hbase. Now I want to save the same dataset to another tables in hbase with selecting specific columns. In this case, should I use persist function after saving to table A? Or if I use only select function, it doesn't matter?

For example:

Dataset<Row> ds = //computing dataset by different transformations
//save ds to table A in hbase

ds.persist();

Dataset<Row> ds2 = ds.select(col("X"));
//save ds2 to table B in hbase

Dataset<Row> ds3 = ds.select(col("Y"),col("Z"));
//save ds3 to table C in hbase

ds.unpersist();

Upvotes: 2

Views: 2852

Answers (2)

Shaido
Shaido

Reputation: 28322

Scala is lazy, in this case that means that all transformations will be redone for every action if you do not persist the data. Hence, if computing the dataset ds

Dataset<Row> ds = //computing dataset by different transformations

takes a long time, then it would absolutly be advantageous to persist the data. For best effect, I would recommend it is done before the first save (the save to table A). In the persiting is done after that, all the reading of data and transformations will be done twice.

Note that you should not use unpersist() until all actions on the dataset and the subsequent datasets are done.

Upvotes: 2

Debasish
Debasish

Reputation: 113

You can do

Dataset<Row> ds = //computing dataset by different transformations
ds.persist();    
//save ds to table A in hbase

Dataset<Row> ds2 = ds.select(col("X"));
//save ds2 to table B in hbase

Dataset<Row> ds3 = ds.select(col("Y"),col("Z"));
//save ds3 to table C in hbase

ds.unpersist();

This way you can persist everything, and then keep saving different sets of columns to different tables.

Upvotes: 0

Related Questions