MJS
MJS

Reputation: 1623

python pandas trying to reduce reliance on loops

This is a general question on vectorization, but I will use an example to help ask the question. I have a dataframe df with df[col_1] bool (True/False). In df[col_2], I would like to return another True/False based on if the prior five rows of column 1, df[col_1][i-6:i-1], contain a match for df[col_1][i].

This is the loop I am using now, but it is one of many so I think they must be slowing things down as the data gets larger:

for i in df.index:
  if i < 6:
    df[col_2][i] = 0.
  else:
    df[col_2][i] = df[col_1][i] not in tuple(df[col_1].ix[i-6:i-1,col_1)

...output looks like this:

.   col_1   col_2
0   TRUE    
1   TRUE    
2   TRUE    
3   TRUE    
4   FALSE   
5   FALSE   FALSE
6   FALSE   FALSE
7   FALSE   FALSE
8   FALSE   FALSE
9   TRUE    TRUE
10  FALSE   FALSE
11  FALSE   FALSE
12  FALSE   FALSE
13  FALSE   FALSE
14  TRUE    FALSE
15  TRUE    FALSE
16  TRUE    FALSE
17  TRUE    FALSE
18  TRUE    FALSE
19  TRUE    FALSE
20  FALSE   TRUE

How can I do this in pandas with vectorization - maybe using shift() or an offset function?

Upvotes: 3

Views: 160

Answers (1)

JohnE
JohnE

Reputation: 30444

Here's a simple vectorized solution that should be pretty fast, although there is probably a more elegant way to write it. You can just ignore the first 5 rows or overwrite them to NaN if you prefer.

df = pd.DataFrame({ 'col_1':[True,True,True,True,False,False,False,False,
                             False,True,False,False,False,False,True,True,
                             True,True,True,True,False] })

df['col_2'] = ((df!=df.shift(1)) & (df!=df.shift(2)) & (df!=df.shift(3)) & 
               (df!=df.shift(4)) & (df!=df.shift(5)))

If speed really matters, you could do something like the following. It's more than 3x faster than the above and probably about as efficient as you can do here. This is just using the fact that rolling_sum() will interpret booleans as 0/1 and you just need to know if the sum is 0 or 5.

df['rollsum'] = pd.rolling_sum(df.col_1,6) - df.col_1
df['col_3'] = ( ((df.col_1==True ) & (df.rollsum==0)) 
              | ((df.col_1==False) & (df.rollsum==5)) )

    col_1  col_2  rollsum  col_3
0    True   True      NaN  False
1    True  False      NaN  False
2    True  False      NaN  False
3    True  False      NaN  False
4   False   True      NaN  False
5   False  False        4  False
6   False  False        3  False
7   False  False        2  False
8   False  False        1  False
9    True   True        0   True
10  False  False        1  False
11  False  False        1  False
12  False  False        1  False
13  False  False        1  False
14   True  False        1  False
15   True  False        1  False
16   True  False        2  False
17   True  False        3  False
18   True  False        4  False
19   True  False        5  False
20  False   True        5   True

Upvotes: 2

Related Questions