ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19395

same regex but different results in Pandas vs. R

Consider this simple regex aimed at extracting headlines

(\w[\w-]+){2,}

Running it in Python (Pandas) vs. R (stringr) gives totally different results!

In stringr the extraction works correctly: see how the 'this-is-a-very-nice-test' is parsed correctly

library(stringr)
> str_extract_all('stackoverflow.stack.com/read/this-is-a-very-nice-test', 
+                 regex('(\\w[-\\w]+){2,}'))
[[1]]
[1] "stackoverflow"            "stack"                    "read"                     "this-is-a-very-nice-test"

In Pandas, well, the output is a bit puzzling

myseries = pd.Series({'text' : 'stackoverflow.stack.com/read/this-is-a-very-nice-test'})

myseries.str.extractall(r'(\w[-\w]+){2,}')
Out[51]: 
             0
     match    
text 0      ow
     1      ck
     2      ad
     3      st

What is wrong here?

Thanks!

Upvotes: 1

Views: 206

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

The (\w[-\w]+){2,} regex represents a repeated capturing group:

The repeated capturing group will capture only the last iteration

See the regex demo, the substrings highlighted are the values you get in Pandas with .extractall as this method expects a "regular expression pattern with capturing groups" and returns "a DataFrame with one row for each match, and one column for each group".

Opposite to Pandas extractall, the R stringr::str_extract_all omits all captured substrings in its result and only "extracts all matches and returns a list of character vectors".

Upvotes: 1

Mahmoud Elshahat
Mahmoud Elshahat

Reputation: 1959

This is work as expected after change this part "{2,}" to "{1,}"

import re
s = 'stackoverflow.stack.com/read/this-is-a-very-nice-test'
out = re.findall(r'(\w[-\w]+){1,}', s)
print(out)

output:

['stackoverflow', 'stack', 'com', 'read', 'this-is-a-very-nice-test']

EDIT: Explanation from python prespective: repeating qualifier {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n.

in your previous example "{2,}" you set m=2 and n to infinity which means a pattern should be repeated at least 2 times, but if you set m=1 as in "{1,}", it will accept a one time occurrence also it is equivalent to "+" i.e you can replace r'(\w[-\w]+){1,}' to (r'(\w[-\w]+)+' and still get the same result

Upvotes: 0

Related Questions