Reputation: 23811
I'm trying to sum the elements of separate data array by their characteristics efficiently. I have three identifying characteristics (age, year, and cause) in a given array, and for each age, year, cause, I have 1000 values. I need to add those values to another data array when the characteristics are the same. For now, I'm doing something like this where each datasets is ~ (80000, 1000):
import numpy as np
datasets = np.vstack(dataset1, dataset2)
for a in ages:
for y in years:
for c in causes:
output = np.sum(datasets[(age==a) & (year==y) & (cause==c)], axis = 0)
However, with 60,000 iterations, this is incredibly slow. The challenge is that the arrays don't necessarily all have the same shape. Any thoughts?
Upvotes: 1
Views: 689
Reputation: 23811
SEE LINK BELOW
I'm not sure how to properly link another answer to this answer. When I tried one sentence followed by the link, it converted the answer to a comment. I'm now being long-winded to try and make stack-overflow think that this text is long enough to constitute an answer. Here is the link to a great answer to this question.
Summing Arrays by Characteristics in Python
Upvotes: 0
Reputation: 7046
I'd recommend something like accumarray. Your output should be a 3-dimensional data cube where each dimension corresponds to a variable (age, year, cause). Each index in each dimension corresponds to a unique value in your input lists. You can then use something like this cookbook example to accumulate the datasets variable into the appropriate bins using age, year, and cause.
You might also consider using a proper relational database. They're quite fast at these sorts of things. Python ships with sqlite3 as a part of the core. Unfortunately, it's a rather steep learning curve if you've never worked with a relational database before. You'll want to use the group
and aggregate
functionality.
Upvotes: 2