Pylander
Pylander

Reputation: 1591

python pandas: vectorized function value error "lengths do not match"

I have a pandas dataframe:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2

And I have written the following to apply date range columns:

import numpy as np
import pandas as pd

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

def find_window_start_date(x):
    window_start_date_idx = np.argmax(x < start_date_period.end_time)
    return start_date_period[window_start_date_idx]

df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)

def find_window_end_date(x):
    window_end_date_idx = np.argmin(x > end_date_period.start_time)
    return end_date_period[window_end_date_idx]

df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)

However this is very slow, so I have been trying to vectorize it:

import numpy as np
import pandas as pd

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

def find_window_start_date(x):
    window_start_date_idx = np.argmax(x < start_date_period.end_time)
    return start_date_period[window_start_date_idx]

df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)

def find_window_end_date(x):
    window_end_date_idx = np.argmin(x > end_date_period.start_time)
    return end_date_period[window_end_date_idx]

df['window_end_dt'] = find_window_end_date(df['transaction_dt'].values)

However this produces a value error: "Lengths must match to compare". I am new to trying to write vectorized functions from scratch, so I would appreciate any insight into where I am going awry.

EDIT:

Here is the full error message:

ValueError                                Traceback (most recent call last)
<ipython-input-11-a781075467c5> in <module>()
      5     return start_date_period[window_start_date_idx]
      6 
----> 7 df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)
      8 
      9 def find_window_end_date(x):

<ipython-input-11-a781075467c5> in find_window_start_date(x)
      2 
      3 def find_window_start_date(x):
----> 4     window_start_date_idx = np.argmax(x < start_date_period.end_time)
      5     return start_date_period[window_start_date_idx]
      6 

C:\Users\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
    826             if (not is_scalar(lib.item_from_zerodim(other)) and
    827                     len(self) != len(other)):
--> 828                 raise ValueError('Lengths must match to compare')
    829 
    830             if isinstance(other, ABCPeriodIndex):

ValueError: Lengths must match to compare

EDIT:

I ended up finding an edge condition with the original solution when there are collisions on the first/last days of a 30 day window. I have made some changes to get closer to a robust solution now:

start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(start_date_range.values < tra, axis=1)
idx2 = np.argmax(end_date_range.values > tra, axis=1)

df['window_start_dt'] = start_date_range[idx1]
df['window_end_dt'] = end_date_range[idx2]

However, this is still not working correctly because it only sets 'window_start_dt' to the lowest/first value in the date range: '2004-01-01'. Good news is it should be faster yet again.

EDIT:

I added an answer with the solution to the date collision issue below based on jezrael's answer

EDIT

Turns out there was still one more edge case. Please see jezrael's answer here for the final solution:Numpy: conditional np.where replace

Upvotes: 1

Views: 432

Answers (3)

Pylander
Pylander

Reputation: 1591

I actually found a particularly clever fix for the date collision issue which is just a small change from @jezrael's fantastic answer:

It occurred to me that start_date_range and end_date_range have the exact same number of elements and that, in this specific case, you can just reuse the index created from end_date_range and apply it via np.take to start_date_range to get the correct intervals every time:

start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx = np.argmax(end_date_range.values > tra, axis=1)

df['window_start_dt'] = np.take(start_date_range, idx)
df['window_end_dt'] = end_date_range[idx]

It is also even faster now!

Upvotes: 2

jezrael
jezrael

Reputation: 862681

I believe you can use broadcasting:

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(tra < start_date_period.end_time.values, axis=1)
idx2 = np.argmin(tra > end_date_period.start_time.values, axis=1)

df['window_start_dt'] = start_date_period[idx1]
df['window_end_dt'] = end_date_period[idx2]
print (df)

   customer_id transaction_dt product  price  units window_start_dt  \
0            1     2004-01-02  thing1     25     47      2004-01-01   
1            1     2004-01-17  thing2    150      8      2004-01-01   
2            2     2004-01-29  thing2    150     25      2004-01-01   
3            3     2017-07-15  thing3     55     17      2017-06-21   
4            3     2016-05-12  thing3     55     47      2016-04-27   
5            4     2012-02-23  thing2    150     22      2012-02-18   
6            4     2009-10-10  thing1     25     12      2009-10-01   
7            4     2014-04-04  thing2    150      2      2014-03-09   
8            5     2008-07-09  thing2    150     43      2008-07-08   

  window_end_dt  
0    2004-01-30  
1    2004-01-30  
2    2004-01-30  
3    2017-07-20  
4    2016-05-26  
5    2012-03-18  
6    2009-10-30  
7    2014-04-07  
8    2008-08-06  

Upvotes: 1

wwii
wwii

Reputation: 23753

I'll sidestep the error you are getting for mismatching lengths.

This still uses .apply so it may not be what you want, but maybe simplifying it will help:

Make a column of pandas.Period's with a month frequency.

df['window'] = df.transaction_dt.apply(pd.Period, freq='M')

> df
=>            transaction_dt product  price  units  window
customer_id                                             
1               2004-01-02  thing1     25     47 2004-01
1               2004-01-17  thing2    150      8 2004-01
2               2004-01-29  thing2    150     25 2004-01
3               2017-07-15  thing3     55     17 2017-07
3               2016-05-12  thing3     55     47 2016-05
4               2012-02-23  thing2    150     22 2012-02
4               2009-10-10  thing1     25     12 2009-10
4               2014-04-04  thing2    150      2 2014-04
5               2008-07-09  thing2    150     43 2008-07

Each Period has a start_time and end_time attribute which is easy to use

>   df.window.dt.start_time
=> customer_id
1   2004-01-01
1   2004-01-01
2   2004-01-01
3   2017-07-01
3   2016-05-01
4   2012-02-01
4   2009-10-01
4   2014-04-01
5   2008-07-01
Name: window, dtype: datetime64[ns]

If you really need a start and end column;

df['start'] = df.window.dt.start_time
df['end'] = df.window.dt.end_time

>   df.head()
=>             transaction_dt product  price  units  window      start        end
customer_id                                                                   
1               2004-01-02  thing1     25     47 2004-01 2004-01-01 2004-01-31
1               2004-01-17  thing2    150      8 2004-01 2004-01-01 2004-01-31
2               2004-01-29  thing2    150     25 2004-01 2004-01-01 2004-01-31
3               2017-07-15  thing3     55     17 2017-07 2017-07-01 2017-07-31
3               2016-05-12  thing3     55     47 2016-05 2016-05-01 2016-05-31

Upvotes: 0

Related Questions