Reputation: 141
In Apache Spark, what are the differences between those API? Why and when should we choose one over the others?
Upvotes: 3
Views: 5434
Reputation: 15539
First, lets define what spark does
Simply put what it does is to execute operations on distributed data. Thus, the operations also need to be distributed. Some operations are simple, such as filter out all items that doesn't respect some rule. Others are more complex, such as groupBy that needs to move data around, and join that needs to associate items from 2 or more datasets.
Another important fact is that input and output are stored in different formats, spark has connectors to read and write those. But that means to serialize and deserialize them. While being transparent, serialization is often the most expensive operation.
Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once again, it is done transparently but can be costly. Interesting fact: estimating the data size can take time
The APIs
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the jvm classes and methods.
It came after and is semantically very different from RDD. The data are considered as tables and operations such as sql operations can be applied on it. It is not typed at all, so error can arise at any time during execution. However, there are I think 2 pros: (1) many people are used to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole line to process one of its column, if the data format provide suitable column access. And many do, such as the parquet file format that is the most commonly used.
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which we associate an "encoder" related to a jvm class. So spark can check that the data schema is correct before executing the code. Note however that, we can read sometime that dataset are strongly type, but it is not: it brings some strongly type safety where you cannot compile code that use a Dataset with a type that is not what has been declared. But it is very easy to make code that compile but still fail at runtime. This is because many dataset operations loose the type (pretty much everything apart from filter). Still it is a huge improvements because even when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at start) instead of during data processing.
Note: Dataframe are now simply untyped Dataset (Dataset<Row>
)
Note2: Dataset provide the main API of RDD, such as map and flatMap. From what I know, it is a short cut to convert to rdd, then apply map/flatMap, then convert to dataset. It's practical, but also hide the conversion making it difficult to realize that possibly costly ser/deser-ialization happened.
Pros and cons
Dataset:
Dataframe:
RDD:
Conclusion
Just use Dataset by default:
There are cases where what you want to code would be too complex to express using dataset operations. Most app doesn't, but it often happen in my work where I implements complex mathematical models. In this case:
Upvotes: 11
Reputation: 2091
In short:
I use dataframes and highly recommend them: Spark's optimizer, Catalyst, understands better datasets (and as such, dataframes) and the Row is a better storage container than a pure JVM object. You will find a lot of blog posts (including Databricks') on the internals.
Upvotes: 0