Reputation: 47
I have a dataset, for simplicity I will indicate only one main feature - postalCode. And I need to get another one feature (main post office of this area) through function call and add to dataframe (sample).
Both are integers.
postalCode | mainPostCode |
---|---|
12345 | 12301 |
23456 | 23407 |
34567 | 34504 |
Some words about function: it takes first 3 digits of postalCode and then takes from list of all zipcodes minimum value, that starts from this 3 digits.
You will not always find in this list a value that will look like XXX01, that can be XXX05 or XXX07 or XXX(any other). Let's assume it can be any number.
List of zipcodes looks like that (about 40K elements):
zipcode = [1001,1002,...,99999]
My function looks like:
def findMainPostOffice(num):
''' takes zip and returns nearest available main zip in list 'zipcode' '''
start = int(str(num // 100) + '00')
m = min([i for i in zipcode if i > start and i < num], default=num)
return m
I call this function like:
df['mainPostCode'] = df.postalCode.apply(findMainPostOffice)
The problem is that this function takes a very long time. On my dataset it should take about 72 hours. Could you please help me to speed up this.
Upvotes: 2
Views: 95
Reputation: 982
Pre-sort your zip main post office zip codes and use binary search.
from bisect import bisect_right
# MAKE SURE THEY ARE SORTED
zip_codes = list(range(1000, 100000))
# BEFORE
def findMainPostOffice(num):
"""takes zip and returns nearest available main zip in list 'zipcode'"""
start = int(str(num // 100) + "00")
m = min([i for i in zip_codes if i > start and i < num], default=num)
return m
# AFTER
def find_main_post_office(zip_code):
start = int(str(zip_code // 100) + "00")
index = bisect_right(zip_codes, start)
try:
return min(zip_codes[index], zip_code)
except IndexError:
return zip_code
"""
In [11]: %timeit findMainPostOffice(20050)
1.96 ms ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [16]: %timeit find_main_post_office(20050)
463 ns ± 0.462 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
"""
Upvotes: 1
Reputation: 120519
I add another answer because my understanding is totally different so the method too. You can use merge_asof
:
# Sort your dataframe and create the postal code prefix
df = df.sort_values('postalCode').assign(prefix=lambda x: x['postalCode'].astype(str).str.zfill(5).str[:2])
# Convert your zipcode list as dataframe and create the postal code prefix
zp = pd.DataFrame({'mainPostCode': zipcode}).sort_values('mainPostCode').assign(prefix=lambda x: x['mainPostCode'].astype(str).str.zfill(5).str[:2])
# Use merge_asof to merge on zip codes but first by prefix
# Look at the nearest lower value
out = (pd.merge_asof(df, zp, direction='backward', by='prefix',
left_on='postalCode', right_on='mainPostCode')
.set_index(df.index).sort_index()
.drop(columns='prefix').fillna(-1)
.astype({'mainPostCode': int}))
print(out)
# Output
postalCode mainPostCode
0 23041 23039
1 48558 48556
2 52895 52893
3 39817 39808
4 40427 40427
... ... ...
39995 81184 81181
39996 7125 7122
39997 22773 22773
39998 88802 88799
39999 58510 58506
Upvotes: 1
Reputation: 3129
You should try to move as much computation out of the function as possible.
For any prefix, we want the lowest main postal office, so we can create a map of prefixes to main postal codes. Since there's only 1000
possible prefixes, this doesn't take up that much space.
One approach is to create a dict of prefixes to all possible zipcodes. For the list of zipcodes [10001, 10002, 20010, 20004]
, we create the map:
{
100: 10001,
200: 20004,
}
We don't care about zip codes 10001
or 20010
, because we would never return them.
By only creating the map once, and using it multiple times, we don't have to check the entire list each time we search for a zipcode.
Here's the code to generate the map:
zipcodes = [10000, 99999]
prefix_map = {}
for z in zipcodes:
prefix = z // 100
if prefix in prefix_map:
prefix_map[prefix] = min(prefix_map[prefix], z)
else:
prefix_map[prefix] = z
and here's code that uses the prefix_map
def findMainPostOffice(num):
global prefix_map
prefix = num // 100
if prefix in prefix_map and prefix_map[prefix] < num:
return prefix_map[prefix]
return num
Upvotes: 1
Reputation: 120519
IIUC, you can use groupby
to find the minimum (the main postal code)
df['mainPostCode'] = (df.groupby(df['postalCode'].astype(str).str.zfill(5).str[:2])
.transform('min'))
print(df)
# Output
postalCode mainPostCode
0 23041 23003
1 48558 48000
2 52895 52000
3 39817 39000
4 40427 40000
... ... ...
39995 81184 81000
39996 7125 7001
39997 22773 22003
39998 88802 88002
39999 58510 58000
[40000 rows x 2 columns]
Input:
import pandas as pd
import numpy as np
np.random.seed(2023)
df = pd.DataFrame({'postalCode': np.random.randint(1000, 100000, 40000)})
Upvotes: 1