Dima
Dima

Reputation: 47

How to speed up python function call

I have a dataset, for simplicity I will indicate only one main feature - postalCode. And I need to get another one feature (main post office of this area) through function call and add to dataframe (sample).

Both are integers.

postalCode mainPostCode
12345 12301
23456 23407
34567 34504

Some words about function: it takes first 3 digits of postalCode and then takes from list of all zipcodes minimum value, that starts from this 3 digits.

You will not always find in this list a value that will look like XXX01, that can be XXX05 or XXX07 or XXX(any other). Let's assume it can be any number.

List of zipcodes looks like that (about 40K elements):

zipcode = [1001,1002,...,99999]

My function looks like:

def findMainPostOffice(num):

    ''' takes zip and returns nearest available main zip in list 'zipcode' '''

    start = int(str(num // 100) + '00')
    m = min([i for i in zipcode if i > start and i < num], default=num)
    return m

I call this function like:

df['mainPostCode'] = df.postalCode.apply(findMainPostOffice)

The problem is that this function takes a very long time. On my dataset it should take about 72 hours. Could you please help me to speed up this.

Upvotes: 2

Views: 95

Answers (4)

westandskif
westandskif

Reputation: 982

Pre-sort your zip main post office zip codes and use binary search.

from bisect import bisect_right

# MAKE SURE THEY ARE SORTED
zip_codes = list(range(1000, 100000))

# BEFORE
def findMainPostOffice(num):

    """takes zip and returns nearest available main zip in list 'zipcode'"""

    start = int(str(num // 100) + "00")
    m = min([i for i in zip_codes if i > start and i < num], default=num)
    return m

# AFTER
def find_main_post_office(zip_code):
    start = int(str(zip_code // 100) + "00")
    index = bisect_right(zip_codes, start)
    try:
        return min(zip_codes[index], zip_code)
    except IndexError:
        return zip_code

"""
In [11]: %timeit findMainPostOffice(20050)
1.96 ms ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit find_main_post_office(20050)
463 ns ± 0.462 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
"""

Upvotes: 1

Corralien
Corralien

Reputation: 120519

I add another answer because my understanding is totally different so the method too. You can use merge_asof:

# Sort your dataframe and create the postal code prefix
df = df.sort_values('postalCode').assign(prefix=lambda x: x['postalCode'].astype(str).str.zfill(5).str[:2])

# Convert your zipcode list as dataframe and create the postal code prefix
zp = pd.DataFrame({'mainPostCode': zipcode}).sort_values('mainPostCode').assign(prefix=lambda x: x['mainPostCode'].astype(str).str.zfill(5).str[:2])

# Use merge_asof to merge on zip codes but first by prefix
# Look at the nearest lower value
out = (pd.merge_asof(df, zp, direction='backward', by='prefix',
                     left_on='postalCode', right_on='mainPostCode')
         .set_index(df.index).sort_index()
         .drop(columns='prefix').fillna(-1)
         .astype({'mainPostCode': int}))
print(out)

# Output
       postalCode  mainPostCode
0           23041         23039
1           48558         48556
2           52895         52893
3           39817         39808
4           40427         40427
...           ...           ...
39995       81184         81181
39996        7125          7122
39997       22773         22773
39998       88802         88799
39999       58510         58506

Upvotes: 1

Carson
Carson

Reputation: 3129

You should try to move as much computation out of the function as possible.

For any prefix, we want the lowest main postal office, so we can create a map of prefixes to main postal codes. Since there's only 1000 possible prefixes, this doesn't take up that much space.

One approach is to create a dict of prefixes to all possible zipcodes. For the list of zipcodes [10001, 10002, 20010, 20004], we create the map:

{
    100: 10001,
    200: 20004,
}

We don't care about zip codes 10001 or 20010, because we would never return them.

By only creating the map once, and using it multiple times, we don't have to check the entire list each time we search for a zipcode.

Here's the code to generate the map:

zipcodes = [10000, 99999]

prefix_map = {}

for z in zipcodes:
    prefix = z // 100
    if prefix in prefix_map:
        prefix_map[prefix] = min(prefix_map[prefix], z)
    else:
        prefix_map[prefix] = z

and here's code that uses the prefix_map

def findMainPostOffice(num):
    global prefix_map
    prefix = num // 100
    if prefix in prefix_map and prefix_map[prefix] < num:
        return prefix_map[prefix]
    return num

Upvotes: 1

Corralien
Corralien

Reputation: 120519

IIUC, you can use groupby to find the minimum (the main postal code)

df['mainPostCode'] = (df.groupby(df['postalCode'].astype(str).str.zfill(5).str[:2])
                        .transform('min'))
print(df)

# Output
       postalCode  mainPostCode
0           23041         23003
1           48558         48000
2           52895         52000
3           39817         39000
4           40427         40000
...           ...           ...
39995       81184         81000
39996        7125          7001
39997       22773         22003
39998       88802         88002
39999       58510         58000

[40000 rows x 2 columns]

Input:

import pandas as pd
import numpy as np

np.random.seed(2023)
df = pd.DataFrame({'postalCode': np.random.randint(1000, 100000, 40000)})

Upvotes: 1

Related Questions