makansij
makansij

Reputation: 9865

Why does changing one `np.nan` value change all of the nan values in pandas dataframe?

When I change one value in the entire DataFrame, it changes other values. Compare scenario 1 and scenario 2:

Scenario 1: Here notice that I only have float(np.nan) values for NaNs

info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[float(np.nan)]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])

result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])

result_df = result_df.fillna(0.0)  # does NOT fill in NaNs

The result of Scenario 1 is just a dataframe without the NaNs filled in.

Scenario 2: Here notice that I only have None value in ONE spot

info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[None]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])

result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])

result_df = result_df.fillna(0.0)  # this works!?!

Even though I only fill in one of the NaN values with None, the other float(np.nan)s get filled in with 0.0, as if they are NaNs too.

Why is there some relationship between the NaNs?

Upvotes: 0

Views: 1521

Answers (1)

hpaulj
hpaulj

Reputation: 231385

The 1st info_num is dtype='S3' (strings). In the 2nd it is dtype=object, a mix of integers, nan (a float) and strings (and a None).

In the dataframes I see something that prints as 'nan' in the one, and a mix of None and NaN in the other. It looks like fillna treats None and NaN the same, but ignores a string 'nan'.

The doc for fillna

Fill NA/NaN values using the specified method

Pandas NaN is the same as np.nan.

fillna uses pd.isnull to determine where to put the 0.0 value.

def isnull(obj):
    """Detect missing values (NaN in numeric arrays, None/NaN in object arrays)

For the 2nd case:

In [116]: pd.isnull(result_df)
Out[116]: 
       G     Bd      O      P   keys
0  False  False  False  False  False
1  False  False  False   True  False
2  False  False   True  False   True
3  False  False  False  False  False
4  False  False  False  False  False

(its all False for the first, string, case).


In [121]: info_num0
Out[121]: 
array([['4', '8', '5', '6', 'ui'],
       ['1', '5', '6', 'nan', 'g'],
       ['6', '1', 'nan', '90', 'nan'],
       ['5', '2', '8', '4', 'q'],
       ['1', '6', '4', '3', 'w']], 
      dtype='<U3')
In [122]: info_num
Out[122]: 
array([[1, 8, 3, 0, 'ui'],
       [1, 5, 1, None, 'g'],
       [0, 2, nan, 90, nan],
       [7, 7, 1, 4, 'q'],
       [3, 7, 0, 3, 'w']], dtype=object)

np.nan is float already:

In [125]: type(np.nan)
Out[125]: float

If you'd added dtype=object to the initial array definition, you'd get the same effect as using that None:

In [140]: np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']],dtype=object)
Out[140]: 
array([[6, 7, 8, 1, 'ui'],
       [5, 2, 5, nan, 'g'],
       [3, 0, nan, 90, nan],
       [5, 2, 1, 3, 'q'],
       [1, 7, 7, 2, 'w']], dtype=object)

Better yet, create the initial data as a list of lists, rather than an array. numpy arrays have to uniform elements; with a mix of ints, nan, and strings you only get that with dtype=object. But that is little more than an array wrapper around a list. Python lists already allow this kind of diversity.

In [141]: alist = [[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']]
In [142]: alist
Out[142]: 
[[4, 0, 2, 6, 'ui'],
 [3, 3, 3, nan, 'g'],
 [3, 5, nan, 90, nan],
 [4, 0, 6, 7, 'q'],
 [0, 8, 3, 8, 'w']]
In [143]: result_df1 = pd.DataFrame(data=alist, columns=['G','Bd', 'O', 'P', 'keys'])
In [144]: result_df1
Out[144]: 
   G  Bd   O   P keys
0  4   0   2   6   ui
1  3   3   3 NaN    g
2  3   5 NaN  90  NaN
3  4   0   6   7    q
4  0   8   3   8    w

I'm not sure how pandas stores this internally, but result_df1.values does return an object array.

In [146]: result_df1.values
Out[146]: 
array([[4, 0, 2.0, 6.0, 'ui'],
       [3, 3, 3.0, nan, 'g'],
       [3, 5, nan, 90.0, nan],
       [4, 0, 6.0, 7.0, 'q'],
       [0, 8, 3.0, 8.0, 'w']], dtype=object)

So if a column has a nan, all the numbers a float (nan is a kind of float). The first 2 columns remain integer. The last is a mix of strings and that nan.

But dtypes suggest that pandas is using a structured array, with each column being a field with the relevant dtype.

In [147]: result_df1.dtypes
Out[147]: 
G         int64
Bd        int64
O       float64
P       float64
keys     object
dtype: object

The equivalent numpy dtype would be:

dt = np.dtype([('G',np.int64),('Bd',np.int64),('O',np.float64),('P',np.float64), ('keys',object)])

We can make a structured array with this dtype. I have to turn the list of lists into a list of tuples (the structured records):

X = np.array([tuple(x) for x in alist],dt)

producing:

array([(4, 0, 2.0, 6.0, 'ui'), 
       (3, 3, 3.0, nan, 'g'),
       (3, 5, nan, 90.0, nan), 
       (4, 0, 6.0, 7.0, 'q'), 
       (0, 8, 3.0, 8.0, 'w')], 
      dtype=[('G', '<i8'), ('Bd', '<i8'), ('O', '<f8'), ('P', '<f8'), ('keys', 'O')])

That can go directly into Pandas as:

In [162]: pd.DataFrame(data=X)
Out[162]: 
   G  Bd   O   P keys
0  4   0   2   6   ui
1  3   3   3 NaN    g
2  3   5 NaN  90  NaN
3  4   0   6   7    q
4  0   8   3   8    w

Upvotes: 1

Related Questions