Why am I getting strange behavior for creating several boolean series?

Question

I have a DataFrame to which I am adding several boolean columns. For each column, I initialize it to False and then set some values to True. If I do this for one and then for another, the first gets reinitialized to all False. For example,

In [170]: df['racedif']=False

In [171]: df['racedif'][~ df.newpers]=df.ptdtrace[~ df.newpers]!=df.ptdtrace.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]

In [172]: df.racedif.sum()
Out[172]: 28

In [173]: df.sexdif.sum()
Out[173]: 0

In [174]: df['sexdif']=False

In [175]: df['sexdif'][~ df.newpers]=df.pesex[~ df.newpers]!=df.pesex.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]

In [176]: df.sexdif.sum()
Out[176]: 31

In [177]: df.racedif.sum()
Out[177]: 0

But if I first initialize them both to False before setting values, this does not happen.

In [203]: df['sexdif']=False
     ...: df['racedif']=False
     ...: df['sexdif'][~ df.newpers]=df.pesex[~ df.newpers]!=df.pesex.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
     ...: df['racedif'][~ df.newpers]=df.ptdtrace[~ df.newpers]!=df.ptdtrace.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]
     ...: 

In [204]: df.sexdif.sum()
Out[204]: 31

In [205]: df.racedif.sum()
Out[205]: 28

Why is this happening and is this a bug?

Added a simpler example that does not have the same problem. Why?

In [255]: df.x=False

In [256]: df.x[df.is456]=df['truth'][df.is456]

In [257]: df.x
Out[257]: 
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8    False
9    False
Name: x, dtype: bool

In [258]: df.y=False

In [259]: df.y[df.is456]=df['truth'][df.is456]

In [260]: df.y
Out[260]: 
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8    False
9    False
Name: y, dtype: bool

In [261]: df.x
Out[261]: 
0    False
1    False
2    False
3    False
4     True
5     True
6     True
7    False
8    False
9    False
Name: x, dtype: bool

Non-chained indexing

In [281]: df.loc[:,'sexdif']=False

In [282]: df.sexdif.sum()
Out[282]: 0

In [283]: df.loc[:,'sexdif'][~ df.newpers]=df.pesex[~ df.newpers]!=df.pesex.groupby(df.personid).apply(pd.Series.shift)[~ df.newpers]

In [284]: df.sexdif.sum()
Out[284]: 31

In [285]: df.loc[:,'racedif']=False

In [286]: df.sexdif.sum()
Out[286]: 0

Jeff · Accepted Answer

you are chain indexing, see docs here: http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-view-versus-copy

bottom line is use

df.loc[row_indexer,col_indexer] = value

to assign and not

df[col_indexer][row_indexer] = value

Why am I getting strange behavior for creating several boolean series?

Answers (1)

Related Questions