Reputation: 73366
I am reading Spark: RDD operations and I am executing:
In [7]: lines = sc.textFile("data")
In [8]: lines.getNumPartitions()
Out[8]: 1000
In [9]: lineLengths = lines.map(lambda s: len(s))
In [10]: lineLengths.getNumPartitions()
Out[10]: 1000
In [11]: len(lineLengths.collect())
Out[11]: 508524
but I would expect that my dataset gets split in parts, how many? As the number of partitions, i.e. 1000.
Then the map()
would run on every partition and return a local result (which then should be reduced), but if this is the case I would expect lineLenghts
which is a list of numbers, to have length equal to the #partitions, which is not the case.
What am I missing?
Upvotes: 0
Views: 60
Reputation: 11573
len(lineLengths.collect())
or lineLengths.count()
tells you the number of rows in your rdd. lineLengths.getNumPartitions()
, as you noted, is the number of partitions your rdd is distributed over. Each partition of the rdd contains many rows of the dataframe.
Upvotes: 2