Reputation: 43
All,
I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.
leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
transition_volume=len([x for x in dropoff_ids if x in pickup_ids])
union_ids=list(set(pickup_ids + dropoff_ids))
busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
Hi, All,
Here is my revised version. I run this code on my GPS traces data. It is faster.
intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)
busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
smartcard_balance = 0
Many thanks for the help.
Upvotes: 0
Views: 178
Reputation: 2573
I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:
dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )
then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.
leaving_volume = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume = (-np.in1d( dropoff_ids, pickup_ids)).sum()
somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.
Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.
Upvotes: 0
Reputation: 10129
Just a few things I noticed, as some Python efficiency trivia:
if x not in dropoff_ids
Checking for membership using the in
operator is more efficient on a set
than a list
. But iterating with for
through a list
is probably more efficient than on a set
. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.
list(set(pickup_ids + dropoff_ids))
It's more efficient to create your sets before you combine data, rather than creating a long list
and constructing a set
from it. Luckily you probably already have the set
versions around now (see the first comment)!
Above all you need to ask yourself the question:
Is the time I save by constructing extra data structures worth the time it takes to construct them?
Next one:
np.sum([...])
I've been trained by Python to think of constructing a list
and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.
It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.
If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.
Upvotes: 3