Assign value to subset of rows in Pandas dataframe

Question

I want to assign values based on a condition on index in Pandas DataFrame.

class test():
    def __init__(self):
        self.l = 1396633637830123000
        self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
        self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])

    def update(self):
        self.dfa['true'] = False
        self.dfa['idx'] = np.nan
        for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
            self.dfa.ix[beg:end]['true'] = True
            self.dfa.ix[beg:end]['idx'] = i

    def do(self):
        self.update()
        print self.dfa

t = test()
t.do()

Result:

                      A   B   true  idx
1396633637830123000   0   1  False  NaN
1396633637830123001   2   3   True  NaN
1396633637830123002   4   5   True  NaN
1396633637830123003   6   7   True  NaN
1396633637830123004   8   9  False  NaN
1396633637830123005  10  11  False  NaN
1396633637830123006  12  13   True  NaN
1396633637830123007  14  15   True  NaN
1396633637830123008  16  17   True  NaN
1396633637830123009  18  19   True  NaN

The true column is correctly assigned, while the idx column is not. Futhermore, this seems to depend on how the columns are initialized because if I do:

    def update(self):
        self.dfa['true'] = False
        self.dfa['idx'] = False

also the true column does not get properly assigned.

What am I doing wrong?

p.s. the expected result is:

                      A   B   true  idx
1396633637830123000   0   1  False  NaN
1396633637830123001   2   3   True  0
1396633637830123002   4   5   True  0
1396633637830123003   6   7   True  0
1396633637830123004   8   9  False  NaN
1396633637830123005  10  11  False  NaN
1396633637830123006  12  13   True  1
1396633637830123007  14  15   True  1
1396633637830123008  16  17   True  1
1396633637830123009  18  19   True  1

Edit: I tried assigning using both loc and iloc but it doesn't seem to work: loc:

self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i

iloc:

self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i

Jeff · Accepted Answer

You are chain indexing, see here. The warning is not guaranteed to happen.

You should prob just do this. No real need to actually track the index in b, btw.

In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))

In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])

In [46]: dfa['in_b'] = False

In [47]: for i, s in dfb.iterrows():
   ....:     dfa.loc[s['beg']:s['end'],'in_b'] = True
   ....:

or this if you have non-integer dtypes

In [36]: for i, s in dfb.iterrows():
             dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True


In [48]: dfa
Out[48]: 
                      A   B  in_b
1396633637830123000   0   1  False
1396633637830123001   2   3  True
1396633637830123002   4   5  True
1396633637830123003   6   7  True
1396633637830123004   8   9  False
1396633637830123005  10  11  False
1396633637830123006  12  13  True
1396633637830123007  14  15  True
1396633637830123008  16  17  True
1396633637830123009  18  19  True

[10 rows x 3 columns

If b is HUGE this might not be THAT performant.

As an aside, these look like nanosecond times. Can be more friendly by converting them.

In [49]: pd.to_datetime(dfa.index)
Out[49]: 

[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None

Assign value to subset of rows in Pandas dataframe

Answers (1)

Related Questions