metersk
metersk

Reputation: 12529

Issue with string cleanup in Pandas

I have a pandas column that contains rows of words that are surrounded in quotes, brackets or nothing, like this:

"cxxx"
[asdfasd]
asdfasdf
[asdf]
"asdf"

My problem is that the below code is stripping the first and last characters from the elements that don't have quotes or brackets and I'm not sure why.

def keyword_cleanup(x):
    if "\"" or "[" in x:
        return x[1:-1]
    else:
        return x


csv["Keyword"] = csv["Keyword"].apply(keyword_cleanup)

Upvotes: 1

Views: 574

Answers (1)

unutbu
unutbu

Reputation: 880547

if "\"" or "[" in x:

should be

if "\"" in x or "[" in x:    # x must contain a left bracket or double-quote.

or

if x.startswith(('"', '[')): # x must start with a left-braket or double-quote

since Python parses the former as

if ("\"") or ("[" in x):

due to the in operator binding more tightly than or. (See Python operator precedence.)

Since any non-empty string such as "\"" has boolean truth value True, the if-statement's condition is always True, and that is why keyword_cleanup was always returning x[1:-1].


However, also note that Pandas has string operators builtin. Using them will be far faster than using apply to call a custom Python function for each item in the Series.

In [136]: s = pd.Series(['"cxxx"', '[asdfasd]', 'asdfasdf', '[asdf]', '"asdf"'])

In [137]: s.str.replace(r'^["[](.*)[]"]$', r'\1')
Out[137]: 
0        cxxx
1     asdfasd
2    asdfasdf
3        asdf
4        asdf
dtype: object

If you want to strip all brackets or double quotes from both ends of each string, you could instead use

In [144]: s.str.strip('["]')
Out[144]: 
0        cxxx
1     asdfasd
2    asdfasdf
3        asdf
4        asdf
dtype: object

Upvotes: 3

Related Questions