Olga
Olga

Reputation: 1455

How to change a pythons code using scipy.weave? (how a code to do quicker?)

In Python: I have 2 arrays: lon, lat (more than 500 000 points).

     lon = numpy.array()
     lat = numpy.array()

     flag = True
     while flag:
        lon1 = lon[:-1]
        lon2 = lon[1:]
        lat1 = lat[:-1]
        lat2 = lat[1:]

        '''distance'''
        x = (lon2 - lon1)
        y = (lat2 - lat1)
        d = numpy.sqrt(x * x + y * y)

        min = numpy.min(d)
        if min < 0.015:
            j = numpy.where(d == min)[0][0]

            lon[j] = (lon[j] + lon[j + 1]) / 2
            lat[j] = (lat[j] + lat[j + 1]) / 2

            lon = numpy.delete(lon, j + 1)
            lat = numpy.delete(lat, j + 1)

        else:
            flag = False

This code works very slowly!

  1. Prompt please, how a code to do quicker?

  2. I know there is scipy.weave. Prompt please, how to change a code using scipy.weave?

    import scipy
    from scipy import weave
    
    codeC = """
    ???
    """
    scipy.weave.inline(codeC, ['lon', 'lat'], compiler='gcc')
    

Thanks.

Update:

    lon1 = lon[:-1]
    lon2 = lon[1:]
    lat1 = lat[:-1]
    lat2 = lat[1:]

    x = (lon2 - lon1)
    y = (lat2 - lat1)
    d = x * x + y * y

    while True and d.size > 1:

        j = np.argmin(d)
        if d[j] > min_radius:
            break

        lon[j] = (lon[j] + lon[j + 1]) / 2
        lat[j] = (lat[j] + lat[j + 1]) / 2

        '''lon = np.delete(lon, j + 1)
           lat = np.delete(lat, j + 1)
           d = np.delete(d, j)'''
        if j == (d.size - 1):
            lon = lon[:j + 1]
            lat = lat[:j + 1]
            d = d[:j]
        else:
            lon = np.hstack((lon[:j + 1], lon[j + 2:]))
            lat = np.hstack((lat[:j + 1], lat[j + 2:]))
            d = np.hstack((d[:j], d[j + 1:]))


        if j > 0:
            x = lon[j] - lon[j - 1]
            y = lat[j] - lat[j - 1]
            d[j - 1] = x * x + y * y
        if j < (d.size - 1):
            x = lon[j + 1] - lon[j]
            y = lat[j + 1] - lat[j]
            d[j] = x * x + y * y

@ilmiacs, @Jaime, @Luke, thanks! It works much quicker!

But as a whole the code works very long at an array of 500 points...:(((

Upvotes: 2

Views: 194

Answers (1)

ilmiacs
ilmiacs

Reputation: 2576

You are redoing the whole expensive calculation on each iteration in your loop. One iteration only removes one node and then when removing one element you copy the whole arrays and redo the distance calculation. then you search the next node.

Thus, your code is suboptimal as it runs at least O(N^2). However, for what you want to achieve, O(N log N) or possibly O(N) would be sufficient.

A simple python module without numpy should be fast enough if you rearrange the algorithm. Recipe:

  1. Avoid recalculating the whole distance array. Only two values need to be recalculated when you remove one node.
  2. Avoid the copying. It is expensive as well.

Hope this helps.

EDIT:

An example implementation without recalculating the distance array would be this:

lon = numpy.array()
lat = numpy.array()

lon1 = lon[:-1]
lon2 = lon[1:]
lat1 = lat[:-1]
lat2 = lat[1:]

'''distance'''
x = (lon2 - lon1)
y = (lat2 - lat1)
d = x * x + y * y

while True:
    min = numpy.min(d)
    if min < 0.015 * 0.015:
        j = numpy.where(d == min)[0][0]
    else:
        break

    lon[j] = (lon[j] + lon[j + 1]) / 2
    lat[j] = (lat[j] + lat[j + 1]) / 2

    lon = numpy.delete(lon, j + 1)  # <-- you spend most of the time here
    lat = numpy.delete(lat, j + 1)  # <-- and here

    d = numpy.delete(d, j)      # <-- and here

    x = lon[j]-lon[j-1]
    y = lat[j]-lat[j-1]
    d[j-1] = x * x + y * y
    x = lon[j+1]-lon[j]
    y = lat[j+1]-lat[j]
    d[j] = x * x + y * y

Although this will give you a boost, you are still O(N^2), because of the extensive copying. To avoid that, you'd need to store your arrays in a doubly linked list. Then removing an element would be of O(1), not O(N). With the linked list, your worst operation would be the finding of the minimum. This you could improve by introducing a sorted version of d, where you also would keep track of the positions in d. You would need to keep track of the changes though. But finally you would arrive at an algorithm that is O(N log N).

EDIT 2:

Comparing squared distances is not necessary, too, as pointed out by @Jaime. I have corrected the code above.

EDIT 3:

Over the weekend I was thinking more about the algorithm, which turns out to be a very interesting problem and fun to think about. I wanted to share my thoughts.

To get it work O(N log N) is less trivial than I thought. I have suggested to keep a list sorted_d and manage it. Each time a link is removed, we have to update two entries in sorted_d. This means, we have to find the new positions of the two corresponding distances in sorted_d. In a normal list, finding the position can be done by bisection in O(log N). However, as far as I understand it, bisection is not viable in a linked list and turns O(N). This is the problem. I would be pleased if anyone could prove me wrong and show up a way to to insert an element in a sorted linked list in O(log N). I personally don't see it.

There is an alternative to the linked list for sorted_d, however. All one needs is a B-tree or a Patricia-tree, that works with floats as keys. Insertion in such a tree can be done in (quite fast) 'O(log N)`. However, I do not know of any standard implementations of such a tree. Standard implementations work on strings or integers. Therefore, we could achieve it if we had an ordered and collision free mapping from floats to ints. Such a mapping does not exist in general, but could be beyond doubt be constructed for any particular dataset. And the dataset in question is beyond doubt particular: There certainly is an upper bound and a precision.

Hope this helps. If there is further interest, I could also present my thoughts on an optimal data structure for this problem.

Upvotes: 5

Related Questions