Prolle
Prolle

Reputation: 358

Cannot parse strings correctly to remove special characters

I have one column of a df, which contains strings, which I wish to parse:

df = pd.DataFrame({'name':'apple banana orange'.split(), 'size':"2'20 12:00 456".split()})

which gives enter image description here

I wish to remove all ' characters, remove :\d\d and preserve the pure integers, such that the results looks like as follows:

enter image description here

I have tried to extract the integers prior to ':' and filling the NaN with the original data. While this works for the first row (preserving the original data) and for the second row (correctly removes the ' character), for the last row it somehow casts the data of the first row. My code is

df['size'] = df['size'].str.extract('(\d*):').fillna(df['size'])

enter image description here

Upvotes: 1

Views: 153

Answers (3)

MDR
MDR

Reputation: 2670

Try this...

df['size'] = df['size'].str.replace(r"'", '').str.replace(r'((\d{2}):\d{2})', r'\2', regex=True)

Outputs:

    name    size
0   apple   220
1   banana  12
2   orange  456

Upvotes: 0

user15756255
user15756255

Reputation:

Correct me if I am wrong, but can't you do .replace('character', '')?

Upvotes: 0

Henrik Bo
Henrik Bo

Reputation: 433

If you only need to test for the ' and the : in the time stamp this will do the job:

df["size"] = df["size"].str.replace("'", "").str.split(":").map(lambda x: x[0])

Output:

     name size
0   apple  220
1  banana   12
2  orange  456

Upvotes: 1

Related Questions