Reputation: 5907
i have a dataframe named df as df = pd.read_csv('my.csv')
CUSTOMER_MAILID EVENT_GENRE EVENT_LANGUAGE
0 [email protected] |ROMANCE| Hindi
1 [email protected] |DRAMA| TAMIL
2 [email protected] |ROMANCE| Hindi
3 [email protected] |DRAMA| Hindi
4 [email protected] |ACTION|ADVENTURE|SCI-FI| English
5 [email protected] |ACTION|ADVENTURE|COMEDY| English
6 [email protected] |ACTION| Hindi
7 [email protected] |DRAMA| Hindi
8 [email protected] |FANTASY|HORROR|ROMANCE| English
9 [email protected] |ACTION|ADVENTURE|THRILLER| English
10 [email protected] |DRAMA| Hindi
11 [email protected] |ROMANCE|THRILLER| KANNADA
12 [email protected] |DRAMA| Hindi
13 [email protected] |ACTION|ADVENTURE|DRAMA| English
14 [email protected] |ACTION|ADVENTURE|DRAMA| TELUGU
15 [email protected] |BIOPIC|DRAMA| Hindi
16 [email protected] |HORROR|THRILLER| Hindi
17 [email protected] |ACTION|COMEDY|THRILLER| ODIA
18 [email protected] |ACTION|ADVENTURE|SCI-FI| English
19 [email protected] |ROMANCE| Hindi
But when i was querying i found some discrepancy in the sense the str.contains does not returned me the expected output.
d = df.query((df['EVENT_GENRE'].str.contains('|ROMANCE|')) & (df['EVENT_LANGUAGE'] == 'Hindi'))
d
Out[53]:
CUSTOMER_MAILID EVENT_GENRE EVENT_LANGUAGE
0 [email protected] |ROMANCE| Hindi
2 [email protected] |ROMANCE| Hindi
3 [email protected] |DRAMA| Hindi
6 [email protected] |ACTION| Hindi
7 [email protected] |DRAMA| Hindi
10 [email protected] |DRAMA| Hindi
12 [email protected] |DRAMA| Hindi
15 [email protected] |BIOPIC|DRAMA| Hindi
16 [email protected] |HORROR|THRILLER| Hindi
19 [email protected] |ROMANCE| Hindi
As you can see EVENT_GENRE field contains no 'ROAMNCE', but when i am doing without '|' ex. '|ROMANCE|' to 'ROMANCE', i am getting the expected output.
d = df.query((df['EVENT_GENRE'].str.contains('ROMANCE')) & (df['EVENT_LANGUAGE'] == 'Hindi'))
d
Out[55]:
CUSTOMER_MAILID EVENT_GENRE EVENT_LANGUAGE
0 [email protected] |ROMANCE| Hindi
2 [email protected] |ROMANCE| Hindi
19 [email protected] |ROMANCE| Hindi
Then i tried for different scenario with '|' (strange result found) and without('|') (expected result found).
I am just curious if '|' symbol has some effect on str.contains() method.I highly doubt it behaves like "or" operation. Bcoz when i tried with
dd = df.query(df['EVENT_GENRE'].str.contains('FANTASY|HORROR'))
dd
Out[21]:
CUSTOMER_MAILID EVENT_GENRE EVENT_LANGUAGE
8 [email protected] |FANTASY|HORROR|ROMANCE| English
16 [email protected] |HORROR|THRILLER| Hindi
As it seems it treats FANTASY and HORROR with "or" operation.***NOT SURE
And with dd = df.query(df['EVENT_GENRE'].str.contains('|FANTASY|HORROR|')) it select all data.
As of my knowledge inside a strind all included in '' or "" treated as char only(except \t,\r,\n).But i did not know if logical operators ever worked in same way(as many times i have seen & inside a string).
Can anybody please clarify that.Thanks in Adv.
Upvotes: 4
Views: 452
Reputation: 251355
By default, contains
treats your string as a regex to match against the strings. So your "|ROMANCE|"
is treated as a regex. Since the first and last alternates are empty (i.e., there is nothing before the first |
or after the last), it can match the empty string, so it always matches.
You can pass the regex=False
argument to contains
to force it to match only your literal string.
Upvotes: 6
Reputation: 31662
Because |
is a special character and you'll need to escape it with \
symbol:
In [255]: df[df.EVENT_GENRE.str.contains('\|ROMANCE\|')]
Out[255]:
CUSTOMER_MAILID EVENT_GENRE EVENT_LANGUAGE
0 [email protected] |ROMANCE| Hindi
2 [email protected] |ROMANCE| Hindi
8 [email protected] |FANTASY|HORROR|ROMANCE| English
11 [email protected] |ROMANCE|THRILLER| KANNADA
19 [email protected] |ROMANCE| Hindi
Upvotes: 2