lealvcon
lealvcon

Reputation: 158

Whait does Pyspark rdd command do?

I'm just getting started with pyspark and I have a question (maybe it's too easy but I can't see it) I have a dataframe of animal species with columns: 'category', 'name' and 'status' and I'm using this command to obtain some info on the category columns:

df.groupBy('category').count().show()

yielding:

+-----------------+-----+
|         category|count|
+-----------------+-----+
|   Vascular Plant| 4470|
|             Bird|  521|
|           Mammal|  214|
|        Amphibian|   80|
|Nonvascular Plant|  333|
|             Fish|  127|
|          Reptile|   79|
+-----------------+-----+

then I used this line:

df.select('category').rdd.countByValue()

and got this:

defaultdict(int,
        {Row(category='Bird'): 521,
         Row(category='Reptile'): 79,
         Row(category='Fish'): 127,
         Row(category='Vascular Plant'): 4470,
         Row(category='Nonvascular Plant'): 333,
         Row(category='Amphibian'): 80,
         Row(category='Mammal'): 214})

So my question is: what does the 'rdd' part add to the code?

Upvotes: 0

Views: 52

Answers (2)

Huzefa Sadikot
Huzefa Sadikot

Reputation: 581

RDD is the logical representation of dataset in spark. It is stored across multiple machines could be servers too in case of a cluster. These are immutable and can be recovered in case of a failure.

A dataset is a data externally loaded by the user. Could come from any source be it a database a simple text file.

Please refer the following link:

Spark Notes

Upvotes: 1

willwrighteng
willwrighteng

Reputation: 3002

I believe you're converting the spark dataframe into an rdd object by invoking the .rdd method. This is why you get a defaultdict back (a subclass of dict) instead of a table.

See this SO post for more details on the function doing the converting.

Upvotes: 0

Related Questions