Satya
Satya

Reputation: 5907

python : strange str.contains behaviour

i have a dataframe named df as df = pd.read_csv('my.csv')

    CUSTOMER_MAILID                       EVENT_GENRE       EVENT_LANGUAGE  
0   [email protected]                    |ROMANCE|          Hindi   
1   [email protected]                      |DRAMA|          TAMIL   
2        [email protected]                    |ROMANCE|          Hindi   
3   [email protected]                      |DRAMA|          Hindi   
4          [email protected]    |ACTION|ADVENTURE|SCI-FI|        English   
5   [email protected]    |ACTION|ADVENTURE|COMEDY|        English   
6       [email protected]                     |ACTION|          Hindi   
7        [email protected]                      |DRAMA|          Hindi   
8       [email protected]     |FANTASY|HORROR|ROMANCE|        English   
9   [email protected]  |ACTION|ADVENTURE|THRILLER|        English   
10        [email protected]                      |DRAMA|          Hindi   
11  [email protected]           |ROMANCE|THRILLER|        KANNADA   
12  [email protected]                      |DRAMA|          Hindi   
13  [email protected]     |ACTION|ADVENTURE|DRAMA|        English   
14      [email protected]     |ACTION|ADVENTURE|DRAMA|         TELUGU   
15  [email protected]               |BIOPIC|DRAMA|          Hindi   
16    [email protected]            |HORROR|THRILLER|          Hindi   
17    [email protected]     |ACTION|COMEDY|THRILLER|           ODIA   
18  [email protected]    |ACTION|ADVENTURE|SCI-FI|        English   
19    [email protected]                    |ROMANCE|          Hindi   

But when i was querying i found some discrepancy in the sense the str.contains does not returned me the expected output.

 d = df.query((df['EVENT_GENRE'].str.contains('|ROMANCE|')) & (df['EVENT_LANGUAGE'] == 'Hindi'))
 d
 Out[53]: 
     CUSTOMER_MAILID        EVENT_GENRE EVENT_LANGUAGE
 0   [email protected]          |ROMANCE|          Hindi
 2        [email protected]          |ROMANCE|          Hindi
 3   [email protected]            |DRAMA|          Hindi
 6       [email protected]           |ACTION|          Hindi
 7        [email protected]            |DRAMA|          Hindi
 10        [email protected]            |DRAMA|          Hindi
 12  [email protected]            |DRAMA|          Hindi
 15  [email protected]     |BIOPIC|DRAMA|          Hindi
 16    [email protected]  |HORROR|THRILLER|          Hindi
 19    [email protected]          |ROMANCE|          Hindi

As you can see EVENT_GENRE field contains no 'ROAMNCE', but when i am doing without '|' ex. '|ROMANCE|' to 'ROMANCE', i am getting the expected output.

d = df.query((df['EVENT_GENRE'].str.contains('ROMANCE')) & (df['EVENT_LANGUAGE'] == 'Hindi'))

d
Out[55]: 
     CUSTOMER_MAILID EVENT_GENRE EVENT_LANGUAGE
0   [email protected]   |ROMANCE|          Hindi
2        [email protected]   |ROMANCE|          Hindi
19    [email protected]   |ROMANCE|          Hindi

Then i tried for different scenario with '|' (strange result found) and without('|') (expected result found).

I am just curious if '|' symbol has some effect on str.contains() method.I highly doubt it behaves like "or" operation. Bcoz when i tried with

dd = df.query(df['EVENT_GENRE'].str.contains('FANTASY|HORROR'))

dd
Out[21]: 
       CUSTOMER_MAILID               EVENT_GENRE EVENT_LANGUAGE  
8     [email protected]  |FANTASY|HORROR|ROMANCE|        English   
16  [email protected]         |HORROR|THRILLER|          Hindi 

As it seems it treats FANTASY and HORROR with "or" operation.***NOT SURE

And with dd = df.query(df['EVENT_GENRE'].str.contains('|FANTASY|HORROR|')) it select all data.

As of my knowledge inside a strind all included in '' or "" treated as char only(except \t,\r,\n).But i did not know if logical operators ever worked in same way(as many times i have seen & inside a string).

Can anybody please clarify that.Thanks in Adv.

Upvotes: 4

Views: 452

Answers (2)

BrenBarn
BrenBarn

Reputation: 251355

By default, contains treats your string as a regex to match against the strings. So your "|ROMANCE|" is treated as a regex. Since the first and last alternates are empty (i.e., there is nothing before the first | or after the last), it can match the empty string, so it always matches.

You can pass the regex=False argument to contains to force it to match only your literal string.

Upvotes: 6

Anton Protopopov
Anton Protopopov

Reputation: 31662

Because | is a special character and you'll need to escape it with \ symbol:

In [255]: df[df.EVENT_GENRE.str.contains('\|ROMANCE\|')]
Out[255]:
         CUSTOMER_MAILID               EVENT_GENRE EVENT_LANGUAGE
0   [email protected]                 |ROMANCE|          Hindi
2        [email protected]                 |ROMANCE|          Hindi
8       [email protected]  |FANTASY|HORROR|ROMANCE|        English
11  [email protected]        |ROMANCE|THRILLER|        KANNADA
19    [email protected]                 |ROMANCE|          Hindi

Upvotes: 2

Related Questions