Pandas string subscripting does not work in modin (and related questions about converting pandas code to modin)

Question

I recently learned about modin, and am trying to convert some of my code from pandas to modin. My understanding is that modin has some operations that run faster and others that it has not optimized, so it defaults to pandas for those. Thus anything that runs in pandas should run in modin, but this does not seem to be the case.

The following code is WAI in pandas, but I get an error in modin:

#import modin.pandas as pd
import pandas as pd

dates = pd.date_range('20180101',periods=6)
pid=pd.Series(list(range(6)))
strings=pd.Series(['asdfjkl;','qwerty','zxcvbnm']*2)
frame={'id':pid,'date':dates,'strings':strings}

df=pd.DataFrame(frame)

x=2
df['first_x_string']=df['strings'].str[0:x]

print(df)

which returns:

   id       date   strings first_x_string
0   0 2018-01-01  asdfjkl;             as
1   1 2018-01-02    qwerty             qw
2   2 2018-01-03   zxcvbnm             zx
3   3 2018-01-04  asdfjkl;             as
4   4 2018-01-05    qwerty             qw
5   5 2018-01-06   zxcvbnm             zx

but when I use modin.pandas (swapping which line is commented at the start), I get the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      1 x=2
----> 2 df['first_x_string']=df['strings'].str[0:x]
      3 
      4 print(df)

TypeError: 'StringMethods' object is not subscriptable

I also get additional user warnings that I did not get for pandas:

UserWarning: Distributing  object. This may take some time.
UserWarning: Distributing  object. This may take some time.

My questions are:

~~How do I fix this?~~
As I look to convert code to modin, are there specific types of commands that will work in pandas but not in modin?
Do the user warnings indicate that some operations are slower in modin than pandas, so that I should be selective about what I choose to use it for?
Additionally, is it feasible (or desireable) to use modin to do certain operations like read_csv() to create a dataframe, then use pandas to run operations on that dataframe, and possibly use modin again to save the dataframe? For my current processes, loading (and to a lesser degree saving) are the most intensive tasks.

#========================================

Update:

#========================================

I have figured out fixes for the specific question I asked, but would like the other (more general) questions answered. Code for alternative methods of capturing the first x characters in a string, with timing functions:

import time
x=2

tic = time.perf_counter()
#df['first_x_string']=df['strings'].str[0:x]
toc = time.perf_counter()
print(f'original completed in {toc-tic:0.4f} seconds')

tic = time.perf_counter()
df['first_x_string']=df['strings'].str.get(0)+df['strings'].str.get(1)
toc = time.perf_counter()
print(f'2x get() completed in {toc-tic:0.4f} seconds')

tic = time.perf_counter()
df['first_x_string']=[y[0:x] for y in df['strings']]
toc = time.perf_counter()
print(f'list comprehension completed in {toc-tic:0.4f} seconds')

print(df)

Running this on a dataframe that is 100X the example one returns:

Pandas:

original completed in 0.0016 seconds
2x get() completed in 0.0020 seconds
list comprehension completed in 0.0009 seconds
      id       date   strings first_x_string
0      0 2018-01-01  asdfjkl;             as
1      1 2018-01-02    qwerty             qw
2      2 2018-01-03   zxcvbnm             zx
3      3 2018-01-04  asdfjkl;             as
4      4 2018-01-05    qwerty             qw
..   ...        ...       ...            ...
595  595 2019-08-19    qwerty             qw
596  596 2019-08-20   zxcvbnm             zx
597  597 2019-08-21  asdfjkl;             as
598  598 2019-08-22    qwerty             qw
599  599 2019-08-23   zxcvbnm             zx

[600 rows x 4 columns]

modin:

original completed in 0.0000 seconds
2x get() completed in 0.2152 seconds
list comprehension completed in 0.1667 seconds
      id       date   strings first_x_string
0      0 2018-01-01  asdfjkl;             as
1      1 2018-01-02    qwerty             qw
2      2 2018-01-03   zxcvbnm             zx
3      3 2018-01-04  asdfjkl;             as
4      4 2018-01-05    qwerty             qw
..   ...        ...       ...            ...
595  595 2019-08-19    qwerty             qw
596  596 2019-08-20   zxcvbnm             zx
597  597 2019-08-21  asdfjkl;             as
598  598 2019-08-22    qwerty             qw
599  599 2019-08-23   zxcvbnm             zx

[600 rows x 4 columns]

These comparisons seem to illustrate that modin is not always faster, and reiterates my questions about when to use modin, and whether we can mix/match pandas and modin (or if that's not best practice and why).

Pandas string subscripting does not work in modin (and related questions about converting pandas code to modin)

Answers (1)

Related Questions