Reputation: 9865
When I change one value in the entire DataFrame, it changes other values. Compare scenario 1 and scenario 2:
Scenario 1: Here notice that I only have float(np.nan)
values for NaN
s
info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[float(np.nan)]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])
result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])
result_df = result_df.fillna(0.0) # does NOT fill in NaNs
The result of Scenario 1 is just a dataframe without the NaNs filled in.
Scenario 2: Here notice that I only have None
value in ONE spot
info_num = np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[None]+['g'],
[random.randint(0,7) for x in range(2)]+[float(np.nan)]+[90]+[float(np.nan)],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']])
result_df = pd.DataFrame(data=info_num, columns=['G','Bd', 'O', 'P', 'keys'])
result_df = result_df.fillna(0.0) # this works!?!
Even though I only fill in one of the NaN values with None, the other float(np.nan)
s get filled in with 0.0
, as if they are NaN
s too.
Why is there some relationship between the NaN
s?
Upvotes: 0
Views: 1521
Reputation: 231385
The 1st info_num
is dtype='S3'
(strings). In the 2nd it is dtype=object
, a mix of integers, nan
(a float) and strings (and a None
).
In the dataframes I see something that prints as 'nan' in the one, and a mix of None
and NaN
in the other. It looks like fillna
treats None
and NaN
the same, but ignores a string 'nan'.
The doc for fillna
Fill NA/NaN values using the specified method
Pandas NaN
is the same as np.nan
.
fillna
uses pd.isnull
to determine where to put the 0.0
value.
def isnull(obj):
"""Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
For the 2nd case:
In [116]: pd.isnull(result_df)
Out[116]:
G Bd O P keys
0 False False False False False
1 False False False True False
2 False False True False True
3 False False False False False
4 False False False False False
(its all False
for the first, string, case).
In [121]: info_num0
Out[121]:
array([['4', '8', '5', '6', 'ui'],
['1', '5', '6', 'nan', 'g'],
['6', '1', 'nan', '90', 'nan'],
['5', '2', '8', '4', 'q'],
['1', '6', '4', '3', 'w']],
dtype='<U3')
In [122]: info_num
Out[122]:
array([[1, 8, 3, 0, 'ui'],
[1, 5, 1, None, 'g'],
[0, 2, nan, 90, nan],
[7, 7, 1, 4, 'q'],
[3, 7, 0, 3, 'w']], dtype=object)
np.nan
is float
already:
In [125]: type(np.nan)
Out[125]: float
If you'd added dtype=object
to the initial array definition, you'd get the same effect as using that None
:
In [140]: np.array([[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']],dtype=object)
Out[140]:
array([[6, 7, 8, 1, 'ui'],
[5, 2, 5, nan, 'g'],
[3, 0, nan, 90, nan],
[5, 2, 1, 3, 'q'],
[1, 7, 7, 2, 'w']], dtype=object)
Better yet, create the initial data as a list of lists, rather than an array. numpy
arrays have to uniform elements; with a mix of ints, nan, and strings you only get that with dtype=object
. But that is little more than an array wrapper around a list. Python lists already allow this kind of diversity.
In [141]: alist = [[random.randint(0,9) for x in range(4)]+['ui'],
[random.randint(0,8) for x in range(3)]+[np.nan]+['g'],
[random.randint(0,7) for x in range(2)]+[np.nan]+[90]+[np.nan],
[random.randint(0,9) for x in range(4)]+['q'],
[random.randint(0,9) for x in range(4)]+['w']]
In [142]: alist
Out[142]:
[[4, 0, 2, 6, 'ui'],
[3, 3, 3, nan, 'g'],
[3, 5, nan, 90, nan],
[4, 0, 6, 7, 'q'],
[0, 8, 3, 8, 'w']]
In [143]: result_df1 = pd.DataFrame(data=alist, columns=['G','Bd', 'O', 'P', 'keys'])
In [144]: result_df1
Out[144]:
G Bd O P keys
0 4 0 2 6 ui
1 3 3 3 NaN g
2 3 5 NaN 90 NaN
3 4 0 6 7 q
4 0 8 3 8 w
I'm not sure how pandas stores this internally, but result_df1.values
does return an object array.
In [146]: result_df1.values
Out[146]:
array([[4, 0, 2.0, 6.0, 'ui'],
[3, 3, 3.0, nan, 'g'],
[3, 5, nan, 90.0, nan],
[4, 0, 6.0, 7.0, 'q'],
[0, 8, 3.0, 8.0, 'w']], dtype=object)
So if a column has a nan
, all the numbers a float (nan
is a kind of float). The first 2 columns remain integer. The last is a mix of strings and that nan
.
But dtypes
suggest that pandas is using a structured array, with each column being a field
with the relevant dtype.
In [147]: result_df1.dtypes
Out[147]:
G int64
Bd int64
O float64
P float64
keys object
dtype: object
The equivalent numpy
dtype would be:
dt = np.dtype([('G',np.int64),('Bd',np.int64),('O',np.float64),('P',np.float64), ('keys',object)])
We can make a structured array with this dtype. I have to turn the list of lists into a list of tuples (the structured records):
X = np.array([tuple(x) for x in alist],dt)
producing:
array([(4, 0, 2.0, 6.0, 'ui'),
(3, 3, 3.0, nan, 'g'),
(3, 5, nan, 90.0, nan),
(4, 0, 6.0, 7.0, 'q'),
(0, 8, 3.0, 8.0, 'w')],
dtype=[('G', '<i8'), ('Bd', '<i8'), ('O', '<f8'), ('P', '<f8'), ('keys', 'O')])
That can go directly into Pandas as:
In [162]: pd.DataFrame(data=X)
Out[162]:
G Bd O P keys
0 4 0 2 6 ui
1 3 3 3 NaN g
2 3 5 NaN 90 NaN
3 4 0 6 7 q
4 0 8 3 8 w
Upvotes: 1