Reputation: 961
I have a dataframe:df_input1 having 10M rows. One of the column name is "geolocation". For all of the records, I have to find the state name from the geolocation and fill up a column "state" of another dataframe: df_final. For that I have created a function convert_to_state and using this like below:
df_final['State'] = df_input1['geolocations'].apply(convert_to_state)
Is there any faster way to achieve this since this is taking lot of time.
sample data: df_input1
vehicle-no start end geolocations
123 10/12/2019 09:00:12 10/12/2019 11:00:78 fghdrf3245@bafd
456 12/10/2019 06:09:12 10/10/2019 09:23:12 {098ddc76yhfbdb7877]
The custom function:
import reverse_geocoder as rg
import polyline
def convert_to_state(geoloc):
long_lat = polyline.decode(geoloc)[0]
state_name= rg.search(long_lat)[0]["admin1"]
return state_name
Upvotes: 0
Views: 283
Reputation: 1518
I would suggest using numpy to make a vectorized function
import numpy as np
import pandas as pd
import reverse_geocoder as rg
import polyline
def convert_to_state(geoloc):
long_lat = polyline.decode(geoloc)[0]
state_name= rg.search(long_lat)[0]["admin1"]
return state_name
convert_to_state = np.vectorize(convert_to_state) # vectorize the method
col = df_input1['geolocations'].values # A numpy array of the column
df_final['State'] = pd.Series(convert_to_state(col))
The vectorized function running on the numpy array would give a substantial boost, then you convert it back to pandas Series.
I would highly recommend timing this method and the normal .apply
method with %timeit
decorator in ipython and report the runtimes on smaller subsets
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: x = pd.DataFrame(
...: [
...: [1,2,"Some.Text"],
...: [3,4,"More.Text"]
...: ],
...: columns = ["A","B", "C"]
...: )
In [4]: x
Out[4]:
A B C
0 1 2 Some.Text
1 3 4 More.Text
In [5]: def foo_split(t):
...: return t.split(".")[0]
...:
In [6]: %timeit y = x.C.apply(foo_split)
248 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: c = x.C.values # numpy array of the column
In [8]: foo_split_vect = np.vectorize(foo_split)
In [9]: %timeit z = pd.Series(foo_split_vect(c))
159 µs ± 624 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You may see that the speed basically doubles in this case.
Upvotes: 1
Reputation: 1149
Since the subroutine is purely functional in nature ( Each row's processing is not affected by the other row ) we can leverage multithreading to make it run faster
You can use the following
Command Prompt : pip install swifter
import swifter
df_final['State'] = df_input1['geolocations'].swifter.apply(convert_to_state)
Upvotes: 0