Vectorizing a series of CDF samples in Python with NumPy

Question

I am in the process of writing a basic financial program with Python where daily expenses are read in as a table and are turned into a PDF (Probability Density Function) and eventually a CDF (Cummulative Distribution Function) that ranges from 0 to 1 using the build in histogram capability of NumPy. I am trying to randomly sample a daily expense by comparing a random number ranging from 0 to 1 with the CDF array and an array of the CDF center points and using the interp1d functionality of SciPy to determine the interpolated value. I have successfully implemented this algorithm using a for loop, but it is way to slow and am trying to convert it to a vectorized format. I am including an example of the code that does work with a for loop and my attempt thus far in vectorizing the algorithm. I would greatly appreciate any advice on how I can make the vectorized version work and increase the execution speed of the code.

Sample input file:

12.00    March 01, 2014
0.00     March 02, 2014
0.00     March 03, 2014
0.00     March 04, 2014
0.00     March 05, 2014
0.00     March 06, 2014
44.50    March 07, 2014
0.00     March 08, 2014
346.55   March 09, 2014
168.18   March 10, 2014
140.82   March 11, 2014
10.83    March 12, 2014
0.00     March 13, 2014
0.00     March 14, 2014
174.00   March 15, 2014
0.00     March 16, 2014
0.00     March 17, 2014
266.53   March 18, 2014
0.00     March 19, 2014
110.00   March 20, 2014
0.00     March 21, 2014
0.00     March 22, 2014
44.50    March 23, 2014

for loop version of code (that works but is too slow)

#!usr/bin/python
import pandas as pd
import numpy as np
import random
import itertools
import scipy.interpolate

def Linear_Interpolation(rand,Array,Array_Center):
    if(rand < Array[0]):
        y_interp = scipy.interpolate.interp1d((0,Array[0]),(0,Array_Center[0]))
    else:
        y_interp = scipy.interpolate.interp1d(Array,Array_Center)

    final_value = y_interp(rand)
    return (final_value)

#--------- Main Program --------------------
# - Reads the file in and transforms the first column of float variables into
#   an array titled MISC_DATA
File1 = '../../Input_Files/Histograms/Static/Misc.txt'
MISC_DATA = pd.read_table(File1,header=None,names = ['expense','month','day','year'],sep = '\s+')

# Creates the PDF bin heights and edges
Misc_hist, Misc_bin_edges = np.histogram(MISC_DATA['expense'],bins=60,normed=True)
# Creates the CDF bin heights
Misc = np.cumsum(Misc_hist*np.diff(Misc_bin_edges))
# Creates an array of the bin center points along the x axis
Misc_Center = (Misc_bin_edges[:-1] + Misc_bin_edges[1:])/2

iterator = range(0,100)
for cycle in iterator:
    MISC_EXPENSE = Linear_Interpolation(random.random(),Misc,Misc_Center)
    print MISC_EXPENSE

I am trying to vectorize the for loop in the manner shown below and convert the variable MISC_EXPENSE from a scalar into an array, but it is not working. It tells me that the truth value of an array with more than one element is ambiguous. I think it is referring to the fact that the array of random variables 'rand_var' has a different dimension than the arrays 'Misc' and 'Misc_Center'. Any suggestions are appreciated.

rand_var = np.random.rand(100)
MISC_EXPENSE = Linear_Interpolation(rand_var,Misc,Misc_Center)

crlb · Accepted Answer

If I understood your example correct, the code creates one interpolation object per random number, which is slow. However, the interp1d can take a vector of values to be interpolated. And the starting zero should be in the CDF in any case I assume:

y_interp = scipy.interpolate.interp1d(
    np.concatenate((np.array([0]), Misc)),
    np.concatenate((np.array([0]), Misc_Center))
)


new_vals = y_interp(np.random.rand(100))

Vectorizing a series of CDF samples in Python with NumPy

Answers (1)

Related Questions