Reputation: 649
I have a pandas dataframe in which I'm trying to run some operations on a column of string values which includes some missing data being interpreted as float('nan')
, equivalent to:
df = pd.DataFrame({'otherData':[1,2,3,4],'stringColumn':[float('nan'),'Random string one... ','another string.. ','a third string ']})
DataFrame
contents:
otherData stringColumn
1 nan
2 'Random string one... '
3 'another string.. '
4 ' a third string '
I want to clean the stringColumn
data of the various trailing ellipses and whitespace, and impute empty strings, i.e. ''
, for nan
values.
To do this, I'm using code equivalent to:
df['stringColumn'] = df['stringColumn'].fillna('')
df['stringColumn'] = df['stringColumn'].str.strip()
df['stringColumn'] = df['stringColumn'].str.strip('...')
df['stringColumn'] = df['stringColumn'].str.strip('..')
The problem I'm encountering is that when I run this code in the script I've written, it doesn't work. There are still nan
values in my 'stringColumn' column, and there are still some, but not all, ellipses. There are no warning messages. However, when I run the exact same code in the python shell, it works, imputing ''
for nan
, and cleaning up as desired. I've tried running it in IDLE 3.5.0 and Spyder 3.2.4, with the same result.
Upvotes: 0
Views: 223
Reputation: 2553
Your code works for me as well with pandas==0.20.1
.
You can also do this as a one-liner without regexes. The strip()
method supports a chars
argument of characters to remove from both ends of the string.
df['stringColumn'] = df['stringColumn'].fillna('').str.strip('. ')
Docstring for strip()
:
S.strip([chars]) -> str
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
Upvotes: 0
Reputation: 402902
This works nicely for me on pandas v0.20.2
, so you might want to try upgrading with
pip install --upgrade pandas
Call str.strip
first, and you can do this in one str.replace
call.
df.stringColumn = df.stringColumn.fillna('')\
.str.strip().str.replace(r'((?<=^)\.+)|(\.+(?=$))', '')
0
1 Random string one
2 another string
3 a third string
Name: stringColumn, dtype: object
If nan
is not a NaN
type, but a string, just modify your regex:
((?<=^)\.+)|(\.+(?=$))|nan
Regex Details
(
(?<=^) # lookbehind for start of sentence
\.+ # one or more '.'
)
| # regex OR
(
\.+ # one or more '.'
(?=$) # lookahead for end of sentence
)
The regex looks for leading or trailing dots (one or more) and removes them.
Upvotes: 1