Reputation: 23
I write a python program and in that program I need to check if a given value is in a column of the given dataset. To do so I need to iterate over each row and to check equality for the column in each row. It takes a lot of time therefore I want to run it in GPU. I have experience in CUDA C/C++ but not in PyCuda to parallelize it. Could anyone can help me to solve this problem?
for index, row in df.iterrows():
s1 = set(df.iloc[index]['prop'])
if temp in s1:
df.iat[index, df.columns.get_loc('prop')] = 's'
Note: This is a part of my program. I want to parallelize only this part using GPU.
Thanks in advance.
Upvotes: 1
Views: 851
Reputation: 5331
The motivation for this approach is a means to get out of the df.iterrows
paradigm due to its relatively low speed. While it might be possible to split into a dask
dataframe and execute some kind of parallel apply
function, I think that a vectorised approach is acceptably quick due to Numpy/Pandas vectorised operation performance advantages (depicted below).
The way I interpret this code is basically "In the prop
column if the variable temp
is in a list in that column, set the prop
column to 's'
".
for index, row in df.iterrows():
s1 = set(df.iloc[index]['prop'])
if temp in s1:
df.iat[index, df.columns.get_loc('prop')] = 's'
I construct a test dataframe:
df = pd.DataFrame({'temp': ['re'] * 7,
'prop': [['re', 'a'], ['ad', 'ed'], ['see', 'contra'], ['loc', 'idx'],
['reader', 'pandas'], ['alpha', 'omega'], ['a', 'z']]})
Then explode to get all the possible combinations of temp
against prop
sublist elements. Within each resulting group, I aggregate with any
and use this as the masking key for replacing the corresponding prop
index with 's'
.
>>> df['result'] = df['prop'].explode().eq(df['temp']).groupby(level=0).any()
>>> df['prop'] = df['prop'].mask(df['result'], 's')
>>> # df['prop'] = np.where(df['result'], 's', df['prop']) # identical operation
temp prop result
0 re s True
1 re [ad, ed] False
2 re [see, contra] False
3 re [loc, idx] False
4 re [reader, pandas] False
5 re [alpha, omega] False
6 re [a, z] False
This answer is robust to row-by-row changes in the temp
column as well as a (relatively arbitrary) number of elements in prop
sublists. That said, if your data is large, you should subset first to minimise memory usage. Select only the applicable columns then execute.
Note also that df['prop'].explode().eq(df['temp'])
works because the temp
column is broadcasted on index to the exploded prop
column.
Upvotes: 1