Reputation: 1367
I am new to spark and trying to understand the difference between normal RDD and a pair RDD. What are the use-cases where a pair RDD is used as opposed to a normal RDD? If possible, I want to understand the internals of pair RDD with an example. Thanks
Upvotes: 15
Views: 16475
Reputation: 1
What is Spark RDD..? Apache Spark’s Core abstraction is Resilient Distributed Datasets, an acronym for Resilient Distributed Datasets is RDD. Also, a fundamental data structure of Spark. Moreover, Spark RDDs is immutable in nature. As well as the distributed collection of objects. Basically, RDD in spark is designed as each dataset in RDD is divided into logical partitions. Further, we can say here each partition may be computed on different nodes of the cluster. Moreover, Spark RDDs contain user-defined classes.
What is Spark Paired RDD..? Spark Paired RDDs are nothing but RDDs containing a key-value pair. Basically, key-value pair (KVP) consists of a two linked data item in it. Here, the key is the identifier, whereas value is the data corresponding to the key value. Moreover, Spark operations work on RDDs containing any type of objects. However key-value pair RDDs attains few special operations in it. Such as, distributed “shuffle” operations, grouping or aggregating the elements by a key. In addition, on Spark Paired RDDs containing Tuple2 objects in Scala, these operations are automatically available. Basically, operations for the key-value pair are available in the Pair RDD functions class. However, that wraps around a Spark RDD of tuples.
How to Create Spark Paired RDD..? pairsRDD = lines.map(lambda x: (x.split(” “)[0], x))
References: https://data-flair.training/blogs/spark-paired-rdd/
Upvotes: 0
Reputation: 5971
Spark Paired RDDs are nothing but RDDs containing a key-value pair.
Unpaired RDDs consists of any type of objects. However, paired RDDs (key-value) attains few special operations in it. Such as, distributed “shuffle” operations, grouping or aggregating the elements the key.
Upvotes: 1
Reputation: 130
PairRDDs are KEY/VALUE pairs.
Example: If you have a csv with details of airport in a country. We create normal RDD by reading that CSV from path.(columns:Airport ID, Name of airport, Main city served by airport, County where airport is located)
JavaRDD<String> airports = sc.textFile("in/airports.text");
If we want an RDD with airport names and country in which it located,here we have to create pair RDD from above RDD.
JavaPairRDD<String,String> AirportspairRDD = airports.mapToPair((PairFunction<String, String, String>) s -> {
return new Tuple2<>(s.split(",")[1],s.split(",")[3]);
});
Upvotes: 0
Reputation: 320
The key differences are:
pairRDD operations (such as map, reduceByKey etc) produce key,value pairs. Whereas operations on RDD(such as flatMap or reduce) gives you a collection of values or a single value
pairRDD operations are applied on each key/element in parallel.Operations on RDD (like flatMap) are applied to the whole collection.
Upvotes: 3
Reputation: 656
Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key. It is common to extract fields from an RDD (representing, for instance, an event time, customer ID, or other identifier) and use those fields as keys in pair RDD operations.
Upvotes: 2
Reputation:
Pair RDD is just a way of referring to an RDD containing key/value pairs, i.e. tuples of data. It's not really a matter of using one as opposed to using the other. For instance, if you want to calculate something based on an ID, you'd group your input together by ID. This example just splits a line of text and returns a Pair RDD using the first word as the key [1]:
val pairs = lines.map(x => (x.split(" ")(0), x))
The Pair RDD that you end up with allows you to reduce values or to sort data based on the key, to name a few examples.
It would probably do you good to read the link at the bottom, from which I shamelessly copied the example, since the understanding of Pair RDDs and how to work with tuples is quite fundamental to many of the things that you will do in Spark. Read up on 'Transformations on Pair RDDs' to get an understanding of what you typically would want to do once you have your pairs.
[1] https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
Upvotes: 3