Reputation: 422
Tachyon is a distributed, in-memory storage system that is developed separately from Spark which could be used as an off-heap persistence storage during a Spark application
Tungsten is a new Spark SQL component that provides more efficient Spark operations by working directly at the byte level. Since Tungsten no longer depends on working with Java objects, we can use either on-heap (in the JVM) or off-heap storage
In off-heap mode, both reduces garbage collection overhead, since data is not stored as Java objects.
So could I simply consider Tachyon brings benefits to general RDD whereas spark-sql benefits from Tungsten ?
Suppose following code
val df = spark.range(10)
val rdd = df.rdd
df.persist(StorageLevel.OFF_HEAP) // in Tungsten format(bytes)?
df.show
rdd.persist(StorageLevel.OFF_HEAP) // in Tachyon storage ?
rdd.count
Upvotes: 1
Views: 1567
Reputation: 250
Spark interacts with Alluxio and Tungsten for data at different stages.
For Spark, Alluxio is an external distributed storage system, like HDFS. Spark interacts with Alluxio through the filesystem interface (see the following example). It is essentially the same interface by which Spark access HDFS or local filesystem, except the storage service is provided by Alluxio which may leverage memory for storage media.
// save data as text file to Alluxio
> rdd.saveAsTextFile("alluxio://localhost:19998/rdd1")
// read data as text file from Alluxio
> rdd = sc.textFile("alluxio://localhost:19998/rdd1")
// save data as object file to Alluxio
> rdd.saveAsObjectFile("alluxio://localhost:19998/rdd2")
// read data as object file from Alluxio
> rdd = sc.objectFile("alluxio://localhost:19998/rdd2")
Spark only interacts with Alluxio at the stages to read input data files and write output files.
Tungsten is the internal data representation for Spark aiming for the efficiency of memory and CPU. Essentially, the default memory layout of JVM objects is considered inefficient for Spark applications due to the memory space and GC overhead (See the blog on Project Tungsten from databricks). Tungsten helps Spark process data from a binary data format directly without bothering JVM to construct the JVM objects.
As a result, a Spark application may read input files from Alluxio---Alluxio sends Spark the bytes without understanding these bytes, then parse the data and represented it inside Spark according to the protocol Tungsten defintes.
Upvotes: 2
Reputation:
In short both yours statements are incorrect:
OFF_HEAP
storage doesn't use Alluxio anymore and instead uses Spark's internal off-heap store. See for example SPARK-16025.spark.sql.inMemoryColumnarStorage.*
properties.Upvotes: 1
Reputation: 772
Alluxio gets the benefits of memory speed read/write operations. Spark is capable of reading data from Alluxio (in memory storage system). This gives the benefits of avoiding Input/Output(IO) from Harddisk (any file system such as HDFS etc sitting on Hardisk).
Tungsten- is an backend optimization engine of spark. the code written dataframe/dataset APIs or in Spark SQL gets optimized in the form of logical/optimized logical plans by Catalyst Optimizer. Once this stage is over, tungsten optimization engine takes over and is responsible for generating Code (called as 'Code gen') on the fly that is highly optimized for execution on distributed environment.
To me both serve different purposes and I will prefer to treat them separately.
Hope it helps to some extent.
Upvotes: 2