Toi
Toi

Reputation: 137

Regex not working properly for some cases (python)?

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)

   str     int
0   abc    123
1   gbc@*   23abc
2   abc123  abc200
3   124abc  1230&*
4   abcer£$%&*! 230!?*&

num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)

    

Output:

correct_int: ['23', '1230', '230']

As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first

correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']

now it removed for all but sometimes it's not removing when the value is '124abc'

Is my pattern wrong? I have also tried giving different patterns but nothing worked

I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int' Expected output: Once after cleaning and replacing with the old with the cleaned values the output would look like this.

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230

Upvotes: 0

Views: 68

Answers (2)

Abhi_J
Abhi_J

Reputation: 2129

Try the following:

  1. '\D' represents any non digit value, substitute those with empty string '' in int column
  2. [^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
  3. Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)

Output:

     str   int
0    abc   123
1    gbc    23
2    abc   200
3    abc  1230
4  abcer   230

Upvotes: 1

m13op22
m13op22

Reputation: 2337

You can do it with

df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings

df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])

Returns:

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230

Upvotes: 2

Related Questions