Python pandas functions work in shell not in script

Question

I have a pandas dataframe in which I'm trying to run some operations on a column of string values which includes some missing data being interpreted as float('nan'), equivalent to:

df = pd.DataFrame({'otherData':[1,2,3,4],'stringColumn':[float('nan'),'Random string one... ','another string..  ','a third string    ']})

DataFrame contents:

otherData    stringColumn
1            nan
2            'Random string one... '
3            'another string..  '
4            ' a third string    '

I want to clean the stringColumn data of the various trailing ellipses and whitespace, and impute empty strings, i.e. '', for nan values.

To do this, I'm using code equivalent to:

df['stringColumn'] = df['stringColumn'].fillna('')
df['stringColumn'] = df['stringColumn'].str.strip()
df['stringColumn'] = df['stringColumn'].str.strip('...')
df['stringColumn'] = df['stringColumn'].str.strip('..')

The problem I'm encountering is that when I run this code in the script I've written, it doesn't work. There are still nan values in my 'stringColumn' column, and there are still some, but not all, ellipses. There are no warning messages. However, when I run the exact same code in the python shell, it works, imputing '' for nan, and cleaning up as desired. I've tried running it in IDLE 3.5.0 and Spyder 3.2.4, with the same result.

cs95 · Accepted Answer

This works nicely for me on pandas v0.20.2, so you might want to try upgrading with

pip install --upgrade pandas

Call str.strip first, and you can do this in one str.replace call.

df.stringColumn = df.stringColumn.fillna('')\
        .str.strip().str.replace(r'((?<=^)\.+)|(\.+(?=$))', '')

0                     
1    Random string one
2       another string
3       a third string
Name: stringColumn, dtype: object

If nan is not a NaNtype, but a string, just modify your regex:

((?<=^)\.+)|(\.+(?=$))|nan

Regex Details

(
(?<=^)    # lookbehind for start of sentence
\.+       # one or more '.'
)
|         # regex OR
(
\.+       # one or more '.'
(?=$)     # lookahead for end of sentence
)

The regex looks for leading or trailing dots (one or more) and removes them.

Python pandas functions work in shell not in script

Answers (2)

Related Questions