Reputation: 158
I'm just getting started with pyspark and I have a question (maybe it's too easy but I can't see it) I have a dataframe of animal species with columns: 'category', 'name' and 'status' and I'm using this command to obtain some info on the category columns:
df.groupBy('category').count().show()
yielding:
+-----------------+-----+
| category|count|
+-----------------+-----+
| Vascular Plant| 4470|
| Bird| 521|
| Mammal| 214|
| Amphibian| 80|
|Nonvascular Plant| 333|
| Fish| 127|
| Reptile| 79|
+-----------------+-----+
then I used this line:
df.select('category').rdd.countByValue()
and got this:
defaultdict(int,
{Row(category='Bird'): 521,
Row(category='Reptile'): 79,
Row(category='Fish'): 127,
Row(category='Vascular Plant'): 4470,
Row(category='Nonvascular Plant'): 333,
Row(category='Amphibian'): 80,
Row(category='Mammal'): 214})
So my question is: what does the 'rdd' part add to the code?
Upvotes: 0
Views: 52
Reputation: 581
RDD is the logical representation of dataset in spark. It is stored across multiple machines could be servers too in case of a cluster. These are immutable and can be recovered in case of a failure.
A dataset is a data externally loaded by the user. Could come from any source be it a database a simple text file.
Please refer the following link:
Upvotes: 1
Reputation: 3002
I believe you're converting the spark dataframe into an rdd object by invoking the .rdd
method. This is why you get a defaultdict back (a subclass of dict) instead of a table.
See this SO post for more details on the function doing the converting.
Upvotes: 0