Jack Fu
Jack Fu

Reputation: 43

How to Optimize the Python Code

All,

I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.

  leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
  arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
  transition_volume=len([x for x in dropoff_ids if x in pickup_ids])

  union_ids=list(set(pickup_ids + dropoff_ids))
  busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
  busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
  busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
  smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0

Hi, All,

Here is my revised version. I run this code on my GPS traces data. It is faster.

intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)

busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
    busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
    smartcard_balance = 0

Many thanks for the help.

Upvotes: 0

Views: 178

Answers (2)

Magellan88
Magellan88

Reputation: 2573

I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:

dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )

then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.

leaving_volume   = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume  = (-np.in1d( dropoff_ids, pickup_ids)).sum()

somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.

Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.

Upvotes: 0

machine yearning
machine yearning

Reputation: 10129

Just a few things I noticed, as some Python efficiency trivia:

if x not in dropoff_ids

Checking for membership using the in operator is more efficient on a set than a list. But iterating with for through a list is probably more efficient than on a set. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.

list(set(pickup_ids + dropoff_ids))

It's more efficient to create your sets before you combine data, rather than creating a long list and constructing a set from it. Luckily you probably already have the set versions around now (see the first comment)!

Above all you need to ask yourself the question:

Is the time I save by constructing extra data structures worth the time it takes to construct them?

Next one:

np.sum([...])

I've been trained by Python to think of constructing a list and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.

It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.

If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.

Upvotes: 3

Related Questions