Reputation: 7286
In my opinion both should give same answer:
train = pd.read_csv('https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv')
train.name.str.contains('Mr.').sum()
(train.name.str.find('Mr.')>0).sum()
but output is:
647
517
What is the reason behind different result?
Upvotes: 1
Views: 56
Reputation: 863741
Difference is str.contains
also match Mrs.
, because .
is special regex character (it is used to match any character).
I think need escape it or add parameter regex=False
:
print(train.name.str.contains('Mr\.').sum())
517
print(train.name.str.contains('Mr.', regex=False).sum())
517
print((train.name.str.find('Mr.')>0).sum())
517
Testing difference:
a = train.loc[train.name.str.contains('Mr.'), 'name']
b = train.loc[(train.name.str.find('Mr.')>0), 'name']
c = pd.concat([a, b], axis=1, keys=('contains','find'))
c = c[c.isnull().any(axis=1)]
print (c)
contains find
1 Cumings, Mrs. John Bradley (Florence Briggs Th... NaN
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) NaN
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) NaN
9 Nasser, Mrs. Nicholas (Adele Achem) NaN
15 Hewlett, Mrs. (Mary D Kingcome) NaN
18 Vander Planke, Mrs. Julius (Emelia Maria Vande... NaN
19 Masselmani, Mrs. Fatima NaN
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... NaN
31 Spencer, Mrs. William Augustus (Marie Eugenie) NaN
40 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) NaN
41 Turpin, Mrs. William John Robert (Dorothy Ann ... NaN
49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) NaN
52 Harper, Mrs. Henry Sleeper (Myna Haxtun) NaN
53 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin... NaN
66 Nye, Mrs. (Elizabeth Ramell) NaN
85 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... NaN
...
...
Upvotes: 1