milowang
milowang

Reputation: 101

Use regex in Python to rule out string

I'm using pandas to clean the data as below:

s3 = pd.DataFrame({'title':["intermediate" ,"Basmati/sadri" ,"temperate japonica" ,"Temperate japonica" , "Japonica" ,"Tropical japonica" ,"Aromatic (basmati/sandri type" , "indica" , "Aus/boro" , "Aus" ,"aus" ,"japonica" , "tropical japnica", "" , "Indica" , "Intermediate type" ]})

s3.title.replace(r".*[Jj]ap(o)?nica$", "japonica" ,inplace=True,regex=True)

s3.title.replace(r"Indica", "indica" ,inplace=True,regex=True)

print s3

And I got:

                        title
0                    intermediate
1                   Basmati/sadri
2                        japonica
3                        japonica
4                        japonica
5                        japonica
6   Aromatic (basmati/sandri type
7                          indica
8                        Aus/boro
9                             Aus
10                            aus
11                       japonica
12                       japonica
13                               
14                         indica
15              Intermediate type

I want to replace string like:

if string not in  ['japonica', "indica"] :
    string = 'others'

But how to do it as regex:

s3.title.replace(r"some regex", "others" ,inplace=True,regex=True)

Upvotes: 2

Views: 103

Answers (1)

2Cubed
2Cubed

Reputation: 3551

The following should work. It uses three cases, separated by or (|) operators.

  • a negative lookahead to ensure the title does not start with either japonica or indica, with some other characters required.
  • an or statement to ensure that if the title does start with japonica or indica, there are other characters afterwards, confirming that the string is not japonica or indica alone.
  • an empty string.

    s3.title.replace(r'^(?!japonica|indica).+$|^(japonica|indica).+$|^$', "others", inplace=True, regex=True)
    

Upvotes: 1

Related Questions