ValAyal
ValAyal

Reputation: 1219

Pandas - replace all NaN values in DataFrame with empty python dict objects

I have a pandas DataFrame where each cell contains a python dict.

>>> data = {'Q':{'X':{2:2010}, 'Y':{2:2011, 3:2009}},'R':{'X':{1:2013}}}
>>> frame = DataFrame(data)
>>> frame
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}        NaN

I'd like to replace the NaN with an empty dict, to get this result:

                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}        {}

However, because the fillna function interprets empty dict not as a scalar value but as a mapping of column --> value, it does NOTHING if I simply do this (i.e. it doesn't work):

>>> frame.fillna(inplace=True, value={})
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}        NaN

Is there any way to use fillna to accomplish what I want? Do I have to iterate through the entire DataFrame or construct a silly dict with all my columns mapped to empty dict?

Upvotes: 19

Views: 14709

Answers (6)

JDenman6
JDenman6

Reputation: 327

@Josh_Bode's answer helped me a lot. Here's a very slightly different version. I used mask() instead of where() (pretty trivial change). I also updated the way to assign an empty dictionary. By creating a list of dict instances as long as the frame and then assigning that, I avoided the trap of many copies of the same dict.

>>> data = {'Q': {'X': {2: 2010}, 'Y': {2: 2011, 3: 2009}}, 'R': {'X': {1: 2013}}}
>>> frame = DataFrame(data)
>>> frame
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}        NaN

>>> frame.mask(frame.isna(), lambda x: [{} for _ in range(len(frame)])
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}         {}

Upvotes: 0

Josh Bode
Josh Bode

Reputation: 3742

DataFrame.where is a way of achieving this quite directly:

>>> data = {'Q': {'X': {2: 2010}, 'Y': {2: 2011, 3: 2009}}, 'R': {'X': {1: 2013}}}
>>> frame = DataFrame(data)
>>> frame
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}        NaN

>>> frame.where(frame.notna(), lambda x: [{}])
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}         {}

Also, it appears to be a bit faster:

>>> %timeit frame.where(frame.notna(), lambda x: [{}])
791 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit frame.applymap(lambda x: {} if isnull(x) else x)
1.07 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(on larger datasets I've observed speedups of ~10x)

Upvotes: 7

K3---rnc
K3---rnc

Reputation: 7049

Use .values accessor to assign into numpy array directly:

frame.R = frame.R.astype(object)  # assertion

frame.R.values[frame.R.isnull()] = {}

Upvotes: 2

ValAyal
ValAyal

Reputation: 1219

I was able to use DataFrame.applymap in this way:

>>> from pandas import isnull
>>> frame=frame.applymap(lambda x: {} if isnull(x) else x)
>>> frame
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}         {}

This solution avoids the pitfalls in both EdChum's solution (where all NaN cells wind up pointing at same underlying dict object in memory, preventing them from being updated independently from one another) and Shashank's (where a potentially large data structure needs to be constructed with nested dicts, just to specify a single empty dict value).

Upvotes: 18

Shashank Agarwal
Shashank Agarwal

Reputation: 2804

The problem is that when a dict is passed to fillna, it tries to fill the values based on the columns in the frame. So the first solution I tried was -

frame.fillna({column: {} for column in frame.columns})

But if a dictionary is provided at the second level like this, it tries to match the keys against the index, so the solution that worked was -

frame.fillna({column: {ind: {} for ind in frame.index} for column in frame.columns})

Which gives -

                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}         {}

EdChum's answer is probably better for your needs, but this can be used when you don't want to make changes in place.

EDIT: The solution above works well for smaller frames, but can be a problem for larger frames. Using replace can solve that.

frame.replace(np.nan, {column: {} for column in frame.columns})

Upvotes: 3

EdChum
EdChum

Reputation: 394179

This works using loc:

In [6]:

frame.loc[frame['R'].isnull(), 'R'] = {}
frame
Out[6]:
                    Q          R
X           {2: 2010}  {1: 2013}
Y  {2: 2011, 3: 2009}         {}

Upvotes: 1

Related Questions