Reputation: 369

How to make this for loop faster?

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster. I have a pandas dataframe called "acoustics" which contains over 10 million rows:

print(acoustics)
                        timestamp            c0  rowIndex
0        2016-01-01T00:00:12.000Z  13931.500000   8158791
1        2016-01-01T00:00:30.000Z  14084.099609   8158792
2        2016-01-01T00:00:48.000Z  13603.400391   8158793
3        2016-01-01T00:01:06.000Z  13977.299805   8158794
4        2016-01-01T00:01:24.000Z  13611.000000   8158795
5        2016-01-01T00:02:18.000Z  13695.000000   8158796
6        2016-01-01T00:02:36.000Z  13809.400391   8158797
7        2016-01-01T00:02:54.000Z  13756.000000   8158798

and there is the code I wrote:

acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
    time = acoustics.iloc[i]['timestamp']
    sum = 0
    for c in range(5):
        sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
    print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
    deltaAc.append([time, sum])

deltaAc = pd.DataFrame(deltaAc)

It takes a huge amount of time, how can I make it faster?

Upvotes: 2

Answers (3)

Ben.T

Reputation: 29635

You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:

deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'], 
                       'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
                                               *np.array(weights)).sum(1)/sumWeights})

and you get the same values than what I get with your code:

print (deltaAc)
                  timestamp  summation
5  2016-01-01T00:02:18.000Z -41.799986
6  2016-01-01T00:02:36.000Z  51.418728
7  2016-01-01T00:02:54.000Z  -3.111184

Upvotes: 1

bubble

Reputation: 1672

Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:

# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

Upvotes: 0

PiRK

Reputation: 1055

First optimization, weights[c]/sumWeights could be done outside the loop.

weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
        sum += tmp[c]*...

I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:

# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
    sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])

for i in range(len(acoustics)-5):
    deltaAc.append([time[5+i], sum[i])

Upvotes: 0

How to make this for loop faster?

Answers (3)

Related Questions