WHZW
WHZW

Reputation: 455

Manually create dummy based on some condition, what went wrong?

I have a dataset that has a column of numbers and NaNs. I want to create a new column of dummy variables for further calculation. Apparently something is wrong, because whatever I do the dummy will be 1.

import pandas as pd
import numpy as np
all_air = pd.read_csv('small.csv')

all_air['D(0/1)']=np.nan
#all_air['C'].fillna(-1) #pandas will take NaN as 0 in calculation, right?
print all_air['C']


for n in all_air['C']:
    if n is None:
        all_air['D(0/1)'] = 0
    else:
        all_air['D(0/1)'] = 1
all_air.to_csv('sample_small.csv')

I am new to python, so this is as far as I can get to. Thanks in advance.

Upvotes: 2

Views: 333

Answers (1)

ely
ely

Reputation: 77424

The assignment operation

all_air['D(0/1)'] = 0

sets the value to 0 for the entire column named 'D(0/1)'. So in effect, each time you encounter a value of n where n is None, you set the whole column to 0. Likewise, when n is not None you set the whole column to 1.

It seems from your description that you would rather have a mask, those locations where n is None for example, and only modify values at those locations.

This can be achieved with the loc indexer:

all_air['D(0/1)'] = 1
all_air.loc[all_air['C'].isnull(), 'D(0/1)'] = 0

In this example, I made use of the built-in function isnull that can check all the elements of a pandas.Series to see if they are null (NaN or None). It returns a pandas.Series of Boolean values. Those locations which evaluate to True will be considered part of the index for evaluation.

So by passing this as the first dimension of the index for loc, we can modify the values in only those rows. The second dimension identifies the column to modify. Putting the value of 0 on the right-hand-side will automatically broadcast that scalar into a compatible array shape for assigning it into the column (some K-by-1 column vector, where K will be the number of null entries).

Upvotes: 1

Related Questions