Nisba
Nisba

Reputation: 3448

Cannot add a column (pandas `Series`) to a Dask `DataFrame` without introducing `NaN`

I am constructing a Dask DataFrame from a numpy array and after this I would like to add a column from a pandas Series.

Unfortunately the resulting dataframe contains NaN values, and I am not able to understand where the error lies.

from dask.dataframe.core import DataFrame as DaskDataFrame
import dask.dataframe as dd
import pandas as pd
import numpy as np

xy = np.random.rand(int(3e6), 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], int(3e6)), dtype='category')

# alternative 1 ->  # lot of values of x, y are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=1)
print(table.compute())

# alternative 2 ->  # lot of values of c are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())

Any help is appreciated.

Upvotes: 0

Views: 77

Answers (1)

It all comes from a mismatch between the number of elements in c and xy when you do the partitioning. You can try using dd.from_pandas instead of dd.from_array to create the DaskDataFrame. :

import numpy as np
import pandas as pd
import dask.dataframe as dd

n = int(3e6)
xy = np.random.rand(n, 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], n), dtype='category')

table = dd.from_pandas(pd.DataFrame(xy, columns=['x', 'y']), npartitions=table.npartitions)
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())

which returns:

                x         y  c
0        0.488121  0.568258  b
1        0.090625  0.459087  b
2        0.563856  0.193026  a
3        0.333338  0.220935  c
4        0.769926  0.195786  a
...           ...       ... ..
2999995  0.241800  0.114924  b
2999996  0.462755  0.567131  c
2999997  0.473718  0.481577  b
2999998  0.424875  0.937403  c
2999999  0.189081  0.793600  c

Upvotes: 1

Related Questions