pandas invalid escape sequence after update

Question

I am parsing a csv with multi char delimiters in pandas as follows

big_df = pd.read_csv(os.path.expanduser('~/path/to/csv/with/special/delimiters.csv'), 
                     encoding='utf8', 
                     sep='\$\$><\$\$', 
                     decimal=',', 
                     engine='python')
big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')
big_df = big_df.replace(['^<', '>$'], ['', ''], regex=True)

big_df.columns = big_df.columns.to_series().replace(['^<', '>$', '>\$\$'], ['', '', ''], regex=True)

this worked fine until I recently upgrade my pandas installation. Now I see a lot of deprecation warnings:

:3: DeprecationWarning: invalid escape sequence \$
:3: DeprecationWarning: invalid escape sequence \$
:3: DeprecationWarning: invalid escape sequence \$
:3: DeprecationWarning: invalid escape sequence \$
:3: DeprecationWarning: invalid escape sequence \$
:3: DeprecationWarning: invalid escape sequence \$
  sep='\$\$><\$\$',
:7: DeprecationWarning: invalid escape sequence \$
  big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace('\$\$>$', '')

As I need the special delimiters with the $ symbols I am unsure how to properly handle these warnings

Andras Deak -- Слава Україні · Accepted Answer

The problem is that escaping in strings can interfere with escaping in regular expressions. While '\s' is a valid regex token, for python this would represent a special character which doesn't exist (the string literal '\s' automatically gets converted to '\s' i.e. r'\s', and I suspect that this process is what's been deprecated, apparently, from python 3.6).

The point is to always use raw string literals when constructing regular expressions, in order to make sure that python doesn't get confused by the backslashes. While most frameworks used to handle this ambiguity just fine (I assume by ignoring invalid escape sequences), apparently newer versions of certain libraries are trying to force programmers to be explicit and unambiguous (which I fully support).

In you specific case, your patterns should be changed from, say, '\$\$><\$\$' to r'\$\$><\$\$':

big_df.iloc[:, -1] = big_df.iloc[:, -1].str.replace(r'\$\$>$', '')

What actually happens is that the backslashes themselves have to escaped for python, in order to have a literal length-2 '\$' string in your regex pattern:

>>> r'\$\$><\$\$'
'\$\$><\$\$'

pandas invalid escape sequence after update

Answers (1)

Related Questions