Cavidan
Cavidan

Reputation: 23

Paralleize Pandas df.iterrows() by GPU kernel

I write a python program and in that program I need to check if a given value is in a column of the given dataset. To do so I need to iterate over each row and to check equality for the column in each row. It takes a lot of time therefore I want to run it in GPU. I have experience in CUDA C/C++ but not in PyCuda to parallelize it. Could anyone can help me to solve this problem?

for index, row in df.iterrows():
    s1 = set(df.iloc[index]['prop'])
    if temp in s1:
        df.iat[index, df.columns.get_loc('prop')] = 's'

Note: This is a part of my program. I want to parallelize only this part using GPU.

Thanks in advance.

Upvotes: 1

Views: 851

Answers (1)

ifly6
ifly6

Reputation: 5331

The motivation for this approach is a means to get out of the df.iterrows paradigm due to its relatively low speed. While it might be possible to split into a dask dataframe and execute some kind of parallel apply function, I think that a vectorised approach is acceptably quick due to Numpy/Pandas vectorised operation performance advantages (depicted below).

enter image description here


The way I interpret this code is basically "In the prop column if the variable temp is in a list in that column, set the prop column to 's'".

for index, row in df.iterrows():
    s1 = set(df.iloc[index]['prop'])
    if temp in s1:
        df.iat[index, df.columns.get_loc('prop')] = 's'

I construct a test dataframe:

df = pd.DataFrame({'temp': ['re'] * 7, 
                   'prop': [['re', 'a'], ['ad', 'ed'], ['see', 'contra'], ['loc', 'idx'], 
                            ['reader', 'pandas'], ['alpha', 'omega'], ['a', 'z']]})

Then explode to get all the possible combinations of temp against prop sublist elements. Within each resulting group, I aggregate with any and use this as the masking key for replacing the corresponding prop index with 's'.

>>> df['result'] = df['prop'].explode().eq(df['temp']).groupby(level=0).any()
>>> df['prop'] = df['prop'].mask(df['result'], 's')
>>> # df['prop'] = np.where(df['result'], 's', df['prop'])  # identical operation

  temp              prop  result
0   re                 s    True
1   re          [ad, ed]   False
2   re     [see, contra]   False
3   re        [loc, idx]   False
4   re  [reader, pandas]   False
5   re    [alpha, omega]   False
6   re            [a, z]   False

This answer is robust to row-by-row changes in the temp column as well as a (relatively arbitrary) number of elements in prop sublists. That said, if your data is large, you should subset first to minimise memory usage. Select only the applicable columns then execute.

Note also that df['prop'].explode().eq(df['temp']) works because the temp column is broadcasted on index to the exploded prop column.

Upvotes: 1

Related Questions