jhanv
jhanv

Reputation: 79

Dask Dataframe shape attribute is giving wrong shape

I'm trying to find the shape of a subset dataframe of a larger dask dataframe. But Instead of getting the right shape (# of rows), I'm getting a wrong value

In the example, I stored the first 3 rows into a new dataframe, when I'm trying to find the shape[0], the output is 4 rather than 3. Is there any way to solve this issue?

data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
    
    
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)

EDIT:

How did I miss that? Apologies about the bad example.

I was working on the real data of about 24700000 rows stored in dask dataframe (23 partitions) from a csv file. I create a sample dask dataframe by indexing .loc[:100,:] to the original dask dataframe, but when I tried to find the shape, I get 2323 as the number rows.

Can I know how this was calculated? How is the data distributed among all the partitions?

Upvotes: 2

Views: 451

Answers (1)

SultanOrazbayev
SultanOrazbayev

Reputation: 16581

The reason you observe a different number of rows is that .loc will select up to and including the index provided. So this line

only_3 = ddf.loc[:3,:] # this will select 4 rows

is selecting 4 rows, those with index 0,1,2, and 3.

This is based on the pandas API:

A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

Hence, your code appears to be correct in principle, just take note of this particular pandas-specific index slicing syntax.

Update: if the dask dataframe is constructed by reading a csv file (or in another way that does not generate unique index), then each partition will have its own index.

That means, that calling .loc[:3] will yield at most 4 rows from every partition. For example, if there are 5 partitions and each has 10 rows, then calling .loc[:4].compute() will yield a dataframe with 25 rows (thanks to @darthbith for the correction).

If this is not desirable, there is a way to generate a unique index for every row in the dask dataframe, see this answer.

Upvotes: 3

Related Questions