Demitri
Demitri

Reputation: 14039

Creating a Pandas DataFrame from a NumPy masked array?

I am trying to create a Pandas DataFrame from a NumPy masked array, which I understand is a supported operation. This is an example of the source array:

a = ma.array([(1, 2.2), (42, 5.5)],
             dtype=[('a',int),('b',float)],
             mask=[(True,False),(False,True)])

which outputs as:

masked_array(data=[(--, 2.2), (42, --)],
             mask=[( True, False), (False,  True)],
       fill_value=(999999, 1.e+20),
            dtype=[('a', '<i8'), ('b', '<f8')])

Attempting to create a DataFrame with pd.DataFrame(a) returns:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-a4c5236a3cd4> in <module>
----> 1 pd.DataFrame(a)

/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    636             # a masked array
    637             else:
--> 638                 data = sanitize_masked_array(data)
    639                 mgr = ndarray_to_mgr(
    640                     data,

/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_masked_array(data)
    452     """
    453     mask = ma.getmaskarray(data)
--> 454     if mask.any():
    455         data, fill_value = maybe_upcast(data, copy=True)
    456         data.soften_mask()  # set hardmask False if it was True

/usr/local/anaconda/lib/python3.8/site-packages/numpy/core/_methods.py in _any(a, axis, dtype, out, keepdims, where)
     54     # Parsing keyword arguments is currently fairly slow, so avoid it for now
     55     if where is True:
---> 56         return umr_any(a, axis, dtype, out, keepdims)
     57     return umr_any(a, axis, dtype, out, keepdims, where=where)
     58 

TypeError: cannot perform reduce with flexible type

Is this operation indeed supported? Currently using Pandas 1.3.3 and NumPy 1.20.3.

Update

Is this supported? According to the Pandas documentation here:

Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.

The code above was my asking the question "What will I get?" if I passed a NumPy masked array to Pandas, but that was the result I was hoping for. Above was the simplest example I could come up with.

I do expect each Series/column in Pandas to be of a single type.

Update 2

Anyone interested in this should probably see this Pandas GitHub issue; it's noted there that Pandas has "deprecated support for MaskedRecords".

Upvotes: 1

Views: 1715

Answers (2)

hpaulj
hpaulj

Reputation: 231385

If the array has a simple dtype, the dataframe creation works (as documented):

In [320]: a = np.ma.array([(1, 2.2), (42, 5.5)],
     ...:    mask=[(True,False),(False,True)])
In [321]: a
Out[321]: 
masked_array(
  data=[[--, 2.2],
        [42.0, --]],
  mask=[[ True, False],
        [False,  True]],
  fill_value=1e+20)
In [322]: import pandas as pd
In [323]: pd.DataFrame(a)
Out[323]: 
      0    1
0   NaN  2.2
1  42.0  NaN

This a is (2,2), and the result is 2 rows, 2 columns

With the compound dtype, the shape is 1d:

In [326]: a = np.ma.array([(1, 2.2), (42, 5.5)],
     ...:              dtype=[('a',int),('b',float)],
     ...:              mask=[(True,False),(False,True)])
In [327]: a.shape
Out[327]: (2,)

The error is the result of a test on the mask. flexible type refers to your compound dtype:

In [330]: a.mask.any()
Traceback (most recent call last):
  File "<ipython-input-330-8dc32ee3f59d>", line 1, in <module>
    a.mask.any()
  File "/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py", line 57, in _any
    return umr_any(a, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type

The documented pandas feature clearly does not apply to structured arrays. Without studying the pandas code I can't say exactly what it's trying to do at this point, but it's clear the code was not written with structured arrays in mind.

The non-masked part does work, with the desired column dtypes:

In [332]: pd.DataFrame(a.data)
Out[332]: 
    a    b
0   1  2.2
1  42  5.5

Using the default fill:

In [344]: a.filled()
Out[344]: 
array([(999999, 2.2e+00), (    42, 1.0e+20)],
      dtype=[('a', '<i8'), ('b', '<f8')])
In [345]: pd.DataFrame(a.filled())
Out[345]: 
        a             b
0  999999  2.200000e+00
1      42  1.000000e+20

I'd have to look more at ma docs/code to see if it's possible to apply a different fill to the two fields. Filling with nan doesn't work for the int field. numpy doesn't have pandas' int none. I haven't worked enough with that pandas feature to know whether the resulting dtype is still int, or it is changed to object.

Anyways, you are pushing the bounds of both np.ma and pandas with this task.

edit

The default fill_value is a tuple, one for each field:

In [350]: a.fill_value
Out[350]: (999999, 1.e+20)

So we can fill the fields differently, and make a frame from that:

In [351]: a.filled((-1, np.nan))
Out[351]: array([(-1, 2.2), (42, nan)], dtype=[('a', '<i8'), ('b', '<f8')])
In [352]: pd.DataFrame(a.filled((-1, np.nan)))
Out[352]: 
    a    b
0  -1  2.2
1  42  NaN

Looks like I can make a structured array with a pandas dtype, and its associated fill_value:

In [363]: a = np.ma.array([(1, 2.2), (42, 5.5)],
     ...:              dtype=[('a',pd.Int64Dtype),('b',float)],
     ...:              mask=[(True,False),(False,True)],
                       fill_value=(pd.NA,np.nan))
In [364]: a
Out[364]: 
masked_array(data=[(--, 2.2), (42, --)],
             mask=[( True, False), (False,  True)],
       fill_value=(<NA>, nan),
            dtype=[('a', 'O'), ('b', '<f8')])

In [366]: pd.DataFrame(a.filled())
Out[366]: 
      a    b
0  <NA>  2.2
1    42  NaN

Upvotes: 3

mozway
mozway

Reputation: 260790

The question is what would you expect to get? It would be ambiguous for pandas to convert your data.

If you want to get the original data:

>>> pd.DataFrame(a.data)
    a    b
0   1  2.2
1  42  5.5

If you want to consider masked values invalid:

>>> pd.DataFrame(a.filled(np.nan))

BUT, for this you should have all type float in the masked array

Upvotes: 0

Related Questions