same regex but different results in Pandas vs. R

Question

Consider this simple regex aimed at extracting headlines

(\w[\w-]+){2,}

Running it in Python (Pandas) vs. R (stringr) gives totally different results!

In stringr the extraction works correctly: see how the 'this-is-a-very-nice-test' is parsed correctly

library(stringr)
> str_extract_all('stackoverflow.stack.com/read/this-is-a-very-nice-test', 
+                 regex('(\w[-\w]+){2,}'))
[[1]]
[1] "stackoverflow"            "stack"                    "read"                     "this-is-a-very-nice-test"

In Pandas, well, the output is a bit puzzling

myseries = pd.Series({'text' : 'stackoverflow.stack.com/read/this-is-a-very-nice-test'})

myseries.str.extractall(r'(\w[-\w]+){2,}')
Out[51]: 
             0
     match    
text 0      ow
     1      ck
     2      ad
     3      st

What is wrong here?

Thanks!

Wiktor Stribiżew · Accepted Answer

The (\w[-\w]+){2,} regex represents a repeated capturing group:

The repeated capturing group will capture only the last iteration

See the regex demo, the substrings highlighted are the values you get in Pandas with .extractall as this method expects a "regular expression pattern with capturing groups" and returns "a DataFrame with one row for each match, and one column for each group".

Opposite to Pandas extractall, the R stringr::str_extract_all omits all captured substrings in its result and only "extracts all matches and returns a list of character vectors".

same regex but different results in Pandas vs. R

Answers (2)

Related Questions