Reputation: 455
I have a dataset that has a column of numbers and NaNs. I want to create a new column of dummy variables for further calculation. Apparently something is wrong, because whatever I do the dummy will be 1.
import pandas as pd
import numpy as np
all_air = pd.read_csv('small.csv')
all_air['D(0/1)']=np.nan
#all_air['C'].fillna(-1) #pandas will take NaN as 0 in calculation, right?
print all_air['C']
for n in all_air['C']:
if n is None:
all_air['D(0/1)'] = 0
else:
all_air['D(0/1)'] = 1
all_air.to_csv('sample_small.csv')
I am new to python, so this is as far as I can get to. Thanks in advance.
Upvotes: 2
Views: 333
Reputation: 77424
The assignment operation
all_air['D(0/1)'] = 0
sets the value to 0
for the entire column named 'D(0/1)'
. So in effect, each time you encounter a value of n
where n is None
, you set the whole column to 0. Likewise, when n is not None
you set the whole column to 1
.
It seems from your description that you would rather have a mask, those locations where n is None
for example, and only modify values at those locations.
This can be achieved with the loc
indexer:
all_air['D(0/1)'] = 1
all_air.loc[all_air['C'].isnull(), 'D(0/1)'] = 0
In this example, I made use of the built-in function isnull
that can check all the elements of a pandas.Series
to see if they are null (NaN
or None
). It returns a pandas.Series
of Boolean values. Those locations which evaluate to True
will be considered part of the index for evaluation.
So by passing this as the first dimension of the index for loc
, we can modify the values in only those rows. The second dimension identifies the column to modify. Putting the value of 0
on the right-hand-side will automatically broadcast that scalar into a compatible array shape for assigning it into the column (some K
-by-1
column vector, where K
will be the number of null entries).
Upvotes: 1