Reputation: 19395
Consider this simple regex aimed at extracting headlines
(\w[\w-]+){2,}
Running it in Python (Pandas
) vs. R (stringr
) gives totally different results!
In stringr
the extraction works correctly: see how the 'this-is-a-very-nice-test'
is parsed correctly
library(stringr)
> str_extract_all('stackoverflow.stack.com/read/this-is-a-very-nice-test',
+ regex('(\\w[-\\w]+){2,}'))
[[1]]
[1] "stackoverflow" "stack" "read" "this-is-a-very-nice-test"
In Pandas, well, the output is a bit puzzling
myseries = pd.Series({'text' : 'stackoverflow.stack.com/read/this-is-a-very-nice-test'})
myseries.str.extractall(r'(\w[-\w]+){2,}')
Out[51]:
0
match
text 0 ow
1 ck
2 ad
3 st
What is wrong here?
Thanks!
Upvotes: 1
Views: 206
Reputation: 627103
The (\w[-\w]+){2,}
regex represents a repeated capturing group:
The repeated capturing group will capture only the last iteration
See the regex demo, the substrings highlighted are the values you get in Pandas with .extractall
as this method expects a "regular expression pattern with capturing groups" and returns "a DataFrame
with one row for each match, and one column for each group".
Opposite to Pandas extractall
, the R stringr::str_extract_all
omits all captured substrings in its result and only "extracts all matches and returns a list of character vectors".
Upvotes: 1
Reputation: 1959
This is work as expected after change this part "{2,}" to "{1,}"
import re
s = 'stackoverflow.stack.com/read/this-is-a-very-nice-test'
out = re.findall(r'(\w[-\w]+){1,}', s)
print(out)
output:
['stackoverflow', 'stack', 'com', 'read', 'this-is-a-very-nice-test']
EDIT: Explanation from python prespective: repeating qualifier {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n.
in your previous example "{2,}" you set m=2 and n to infinity which means a pattern should be repeated at least 2 times, but if you set m=1 as in "{1,}", it will accept a one time occurrence also it is equivalent to "+" i.e you can replace r'(\w[-\w]+){1,}' to (r'(\w[-\w]+)+' and still get the same result
Upvotes: 0