stephen.m
stephen.m

Reputation: 149

Cleanup with 'replace' not working as intended - Pandas

I've tried to apply the 'replace' method on this list in a dataframe:

df=pd.DataFrame({'DC': ['DOUBLE CHANCE\n1 OR X\n2.50\nX OR 2\n1.12\n1 OR 2\n1.20']})

Desired result: 2.50\n1.12\n1.20

Looking for suggestion to make the cleanup work either 'replace' method or 'regex'

import pandas as pd

df=pd.DataFrame({'DC': ['DOUBLE CHANCE\n1 OR X\n2.50\nX OR 2\n1.12\n1 OR 2\n1.20']})

df = df['DC']['Double_Chance'].str.replace(r'([^\d\.\n])','').str.replace(r'1\n','').str.replace(r'2\n','').str.replace(r'12\n','').str.strip()
df
0       2.50\n1.111.20
Name: Score, dtype: object

Upvotes: 1

Views: 43

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You can use Series.str.replace:

df['Double_Chance'] = df['DC'].str.replace(r'(?m)^(?!\d+\.\d+$).*\n*', '')

Or, you may use Series.str.findall:

df['Double_Chance'] = df['DC'].str.findall(r'(?m)^\d+\.\d+$').str.join("\n")

Both produce 2.50\n1.12\n1.20.

See the regex demo. Details:

  • (?m) - the re.M option that makes ^ match start of each line
  • ^ - start of a line
  • (?!\d+\.\d+$) - fail the match if the line is a float number
  • .* - zero or more chars other than line break chars, as many as possible
  • \n* - zero or more line feed chars.

Upvotes: 1

Related Questions