Reputation: 79
I'm trying to find the shape of a subset dataframe of a larger dask dataframe. But Instead of getting the right shape (# of rows), I'm getting a wrong value
In the example, I stored the first 3 rows into a new dataframe, when I'm trying to find the shape[0], the output is 4 rather than 3. Is there any way to solve this issue?
data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)
EDIT:
How did I miss that? Apologies about the bad example.
I was working on the real data of about 24700000 rows stored in dask dataframe (23 partitions) from a csv file. I create a sample dask dataframe by indexing .loc[:100,:] to the original dask dataframe, but when I tried to find the shape, I get 2323 as the number rows.
Can I know how this was calculated? How is the data distributed among all the partitions?
Upvotes: 2
Views: 451
Reputation: 16581
The reason you observe a different number of rows is that .loc
will select up to and including the index provided. So this line
only_3 = ddf.loc[:3,:] # this will select 4 rows
is selecting 4 rows, those with index 0,1,2, and 3.
This is based on the pandas API
:
A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
Hence, your code appears to be correct in principle, just take note of this particular pandas
-specific index slicing syntax.
Update: if the dask dataframe is constructed by reading a csv file (or in another way that does not generate unique index), then each partition will have its own index.
That means, that calling .loc[:3]
will yield at most 4 rows from every partition. For example, if there are 5 partitions and each has 10 rows, then calling .loc[:4].compute()
will yield a dataframe with 25 rows (thanks to @darthbith for the correction).
If this is not desirable, there is a way to generate a unique index for every row in the dask dataframe, see this answer.
Upvotes: 3