Reputation: 21971
>>> arr
array([[ 0., 10., 0., ..., 0., 0., 0.],
[ 0., 4., 0., ..., 6., 0., 9.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 2., 0., 0.],
[ 0., 0., 0., ..., 0., 3., 0.]])
In the numpy array above, I would like to replace every value that matches the column country_codes
in the dataframe (df_A) with the value from the column continent_codes
in df_A. df_A looks like:
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
Right now, I loop through dataframe and replace using numpy indexing notation. Given that iterrows() tends to be slow, is there a more direct/vectorized way to do this?
for index, row in self.df_A.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']
Upvotes: 5
Views: 1821
Reputation: 18628
with this data as exemple, with at most N countries,
N=10**5
values=np.random.randint(0,N,(1000,1000))
exemple={'country':np.arange(N//2),'continent':randint(1,5,N//2)}
df=pd.DataFrame.from_dict(exemple)
You can just do :
v=np.arange(N)
v[df['country']]=df['continent']
v.take(values,out=values)
probably not optimal, but efficient (20ms).
Upvotes: 1
Reputation: 221564
Approach #1 : One vectorized approach using np.searchsorted
and np.in1d
would be as listed below -
# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
# Mask of elements to be changed
mask = np.in1d(arr,oldval)
# Indices for each match from oldval in arr
idx = np.searchsorted(oldval,arr.ravel()[mask])
# Using the mask put selective elements from continent_codes column into arr
arr.ravel()[mask] = newval[idx]
Sample run -
>>> arr # Original 2D array
array([[23, 4, 23, 5, 8],
[ 3, 6, 8, 5, 11],
[16, 24, 15, 4, 10],
[ 4, 16, 10, 8, 1]])
>>> df
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5
>>> oldval = np.array(df['country_codes'])
>>> newval = np.array(df['continent_codes'])
>>> mask = np.in1d(arr,oldval)
>>> idx = np.searchsorted(oldval,arr.ravel()[mask])
>>> arr.ravel()[mask] = newval[idx]
>>> mask.reshape(arr.shape) # Mask array depiciting which elements were updated
array([[False, True, False, False, True],
[False, False, True, False, False],
[ True, True, False, True, False],
[ True, True, False, True, False]], dtype=bool)
>>> arr # Updated 2D array
array([[23, 4, 23, 5, 3],
[ 3, 6, 3, 5, 11],
[ 6, 5, 15, 4, 10],
[ 4, 6, 10, 3, 1]])
Approach #2 : As a variant, you can also create the mask with a comparison between np.searchsorted(oldval,arr,'left')
and np.searchsorted(oldval,arr,'right')
as discussed in the solutions for this question
and re-use np.searchsorted(oldval,arr,'left')
again later on while putting values into arr
for a more efficient solution, like so -
# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
# Left and right indices for each match from oldval in arr
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
# Mask of elements to be changed
mask = left_idx!=right_idx
# Using the mask put selective elements from continent_codes column into arr
arr[mask] = newval[left_idx[mask]]
Runtime tests and verify outputs
Function definitions -
def original_app(arr,df):
for index, row in df.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']
def vectorized_app1(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
mask = np.in1d(arr,oldval)
idx = np.searchsorted(oldval,arr.ravel()[mask])
arr.ravel()[mask] = newval[idx]
def vectorized_app2(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
mask = left_idx!=right_idx
arr[mask] = newval[left_idx[mask]]
Verify outputs -
In [195]: # Input array
...: arr = np.random.randint(0,100000,(1000,1000))
...:
...: # Setup input dataframe
...: N = 1000
...: oldvals = np.unique(np.random.randint(0,100000,N))
...: newvals = np.random.randint(0,9,(oldvals.size))
...: df=pd.DataFrame({'country_codes':oldvals,'continent_codes':newvals})
...: df = df.reindex_axis(sorted(df.columns)[::-1], axis=1)
...:
...: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:
In [196]: # Verify outputs
...: original_app(arrc1,df)
...: vectorized_app1(arrc2,df)
...: vectorized_app2(arrc3,df)
...:
In [197]: np.allclose(arrc1,arrc2)
Out[197]: True
In [198]: np.allclose(arrc1,arrc3)
Out[198]: True
Timings -
In [199]: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:
In [200]: %timeit original_app(arrc1,df)
1 loops, best of 3: 2.79 s per loop
In [201]: %timeit vectorized_app1(arrc2,df)
1 loops, best of 3: 360 ms per loop
In [202]: %timeit vectorized_app2(arrc3,df)
1 loops, best of 3: 213 ms per loop
Upvotes: 2