Cole Robertson
Cole Robertson

Reputation: 649

Python pandas functions work in shell not in script

I have a pandas dataframe in which I'm trying to run some operations on a column of string values which includes some missing data being interpreted as float('nan'), equivalent to:

df = pd.DataFrame({'otherData':[1,2,3,4],'stringColumn':[float('nan'),'Random string one... ','another string..  ','a third string    ']})


DataFrame contents:

otherData    stringColumn
1            nan
2            'Random string one... '
3            'another string..  '
4            ' a third string    '

I want to clean the stringColumn data of the various trailing ellipses and whitespace, and impute empty strings, i.e. '', for nan values.

To do this, I'm using code equivalent to:

df['stringColumn'] = df['stringColumn'].fillna('')
df['stringColumn'] = df['stringColumn'].str.strip()
df['stringColumn'] = df['stringColumn'].str.strip('...')
df['stringColumn'] = df['stringColumn'].str.strip('..')

The problem I'm encountering is that when I run this code in the script I've written, it doesn't work. There are still nan values in my 'stringColumn' column, and there are still some, but not all, ellipses. There are no warning messages. However, when I run the exact same code in the python shell, it works, imputing '' for nan, and cleaning up as desired. I've tried running it in IDLE 3.5.0 and Spyder 3.2.4, with the same result.

Upvotes: 0

Views: 223

Answers (2)

tdube
tdube

Reputation: 2553

Your code works for me as well with pandas==0.20.1.

You can also do this as a one-liner without regexes. The strip() method supports a chars argument of characters to remove from both ends of the string.

df['stringColumn'] = df['stringColumn'].fillna('').str.strip('. ')

Docstring for strip():

S.strip([chars]) -> str

Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.

Upvotes: 0

cs95
cs95

Reputation: 402902

This works nicely for me on pandas v0.20.2, so you might want to try upgrading with

pip install --upgrade pandas

Call str.strip first, and you can do this in one str.replace call.

df.stringColumn = df.stringColumn.fillna('')\
        .str.strip().str.replace(r'((?<=^)\.+)|(\.+(?=$))', '')

0                     
1    Random string one
2       another string
3       a third string
Name: stringColumn, dtype: object

If nan is not a NaNtype, but a string, just modify your regex:

((?<=^)\.+)|(\.+(?=$))|nan

Regex Details

(
(?<=^)    # lookbehind for start of sentence
\.+       # one or more '.'
)
|         # regex OR
(
\.+       # one or more '.'
(?=$)     # lookahead for end of sentence
)

The regex looks for leading or trailing dots (one or more) and removes them.

Upvotes: 1

Related Questions