Reputation: 349
I am confused about when to use both str.findall and str.match.
For example, I have a df that has many lines of text where I need to extract dates.
Let us say I want to extract check the lines where there is a work Mar (as of the abbreviation of March).
I if I broadcast the df where there is a match
df[df.original.str.match(r'(Mar)')==True]
I got the following output:
204 Mar 10 1976 CPT Code: 90791: No medical servic...
299 March 1974 Primary ...
However, if I try the same regex within the str.findall, I got nothing:
0 []
1 []
2 []
3 []
4 []
5 []
6 []
7 []
...
495 []
496 []
497 []
498 []
499 []
Name: original, Length: 500, dtype: object
Why is that ? I am sure it is a lack of understanding on match, find, findall, extract and extractall.
Upvotes: 0
Views: 1888
Reputation: 651
I try to use the documentation to explain this:
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])
s
output:
A a1a2
B b1
C c1
dtype: object
We first make the Series like this,and then use the extract,extractall,find,findall
s.str.extract("([ab])(\d)",expand=True)#We could use the extract and give the pat which can be str of regx
and only return the first match of the results.
0 1
A a 1
B b 1
C NaN NaN
s.str.extractall("([ab])(\d)")#return all the detail which me match
0 1
match
A 0 a 1
1 a 2
B 0 b 1
s.str.find("([ab])(\d)")#all the values is -1 cause find can only give the string
s.str.find('a')
A 0
B -1
C -1
dtype: int64
s.str.findall("([ab])(\d)")#give a string or regx and return the detail result
A [(a, 1), (a, 2)]
B [(b, 1)]
C []
dtype: object
Upvotes: 1