Sam B.
Sam B.

Reputation: 3043

python/pandas: using regular expressions remove anything in square brackets in string

Working from a pandas dataframe trying to sanitize a column from something like $12,342 to 12342 and make the column into an int or float. Found one row though with 736[4] so I have to remove everything within the square brackets, brackets included.

Code so far

df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace('$','')
df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace(',','')
df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace(' ','')

The line below is what's supposed to handle and remove the square brackets and intentionally with it's content too.

df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace(r'[[^]]*\)','')

To some dev's this is trivial but I've not really used regular expressions often enough to know this and I've also checked around and from one such stack example formulated the above.

Upvotes: 4

Views: 1379

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

You need to use

df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace(r'\[[^][]*]','',regex=True)

NOTE: you must use regex=True if you are using a regex replacement with Series.str.replace.

If you plan to also remove any whitespace before the initial [, you may add \s* at the start of the pattern and use

df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace(r'\s*\[[^][]*]','',regex=True)

See the regex demo.

Details:

  • \s* - zero or more whitespaces
  • \[ - a [ char
  • [^][]* - zero or more chars other than [ and ]
  • ] - a ] char

TIP: If you want to keep texts inside brackets that are not digits only, i.e. if you only want to remove [123], [1], etc. but keep [Fig 1], [notes], etc., you should use \[\d+] or \s*\[\d+].

Upvotes: 0

jezrael
jezrael

Reputation: 863801

I think you need:

df2 = pd.DataFrame({'Average Monthly Wage $': ['736[4]','7336[445]', '[4]345[5]']})
print (df2)
  Average Monthly Wage $
0                 736[4]
1              7336[445]
2              [4]345[5]

df2['Average Monthly Wage $'] = df2['Average Monthly Wage $'].str.replace(r'\[.*?\]','')
print (df2)
  Average Monthly Wage $
0                    736
1                   7336
2                    345

regex101.

Upvotes: 3

Related Questions