Reputation: 19395

How to extract the longest match?

Consider this simple example

library(stringr)
library(dplyr)

dataframe <- data_frame(text = c('how is the biggest ??',
                                 'really amazing stuff'))

# A tibble: 2 x 1
  text                 
  <chr>                
1 how is the biggest ??
2 really amazing stuff

I need to extract some terms based on a regex expression, but only extract the term that is the longest.

So far, I was able to only extract the first match (not necessary the longest) using str_extract.

> dataframe %>% mutate(mymatch = str_extract(text, regex('\\w+')))
# A tibble: 2 x 2
  text                  mymatch
  <chr>                 <chr>  
1 how is the biggest ?? how    
2 really amazing stuff  really

I tried to play with str_extract_all but I cant find an efficient syntax. Output should be:

# A tibble: 2 x 2
  text                  mymatch
  <chr>                 <chr>  
1 how is the biggest ?? biggest
2 really amazing stuff  amazing

Any ideas? Thanks!

Upvotes: 3

Answers (4)

Martin Morgan

Reputation: 46876

As a variant of other answers, I'd suggest writing a function that does the manipuation

longest_match <- function(x, pattern) {
    matches <- str_match_all(x, pattern)
    purrr::map_chr(matches, ~ .[which.max(nchar(.))])
}

Then use it

dataframe %>%
    mutate(mymatch = longest_match(text, "\\w+"))

By way of commentary, it seems better practice to isolate the function that does the new stuff longest_match() from the manipulations enabled by mutate(). For instance, the function is easy to test, can be used in other circumstances, and can be modified ('return the last rather than first longest match') independently of the data transformation step.. There's no real value in sticking everything into one line, so it makes sense to write lines of code that logically accomplish one thing -- find all matches, map from all matches to longest, ... purrr::map_chr() is better than sapply() because it is more robust -- it guarantees that the result is a character vector, so that something like

> df1 = dataframe[FALSE,]
> df1 %>% mutate(mymatch = longest_match(text, "\\w+"))
# A tibble: 0 x 2
# ... with 2 variables: text <chr>, mymatch <chr>

'does the right thing', i.e., mymatch is a character vector (sapply() would return a list in this case).

Upvotes: 2

acylam

Reputation: 18691

You can do something like this:

library(stringr)
library(dplyr)

dataframe %>%
  mutate(mymatch = sapply(str_extract_all(text, '\\w+'), 
                          function(x) x[nchar(x) == max(nchar(x))][1]))

With purrr:

library(purrr)

dataframe %>%
  mutate(mymatch = map_chr(str_extract_all(text, '\\w+'), 
                           ~ .[nchar(.) == max(nchar(.))][1]))

Result:

# A tibble: 2 x 2
                   text mymatch
                  <chr>   <chr>
1 how is the biggest ?? biggest
2  really amazing stuff amazing

Note:

If there is a tie, this takes the first one.

Data:

dataframe <- data_frame(text = c('how is the biggest ??',
                                 'really amazing biggest stuff'))

Upvotes: 7

Andrew Gustar

Reputation: 18425

Or, using purrr...

library(dplyr)
library(purrr)
library(stringr)

dataframe %>% mutate(mymatch=map_chr(str_extract_all(text,"\\w+"),
                                     ~.[which.max(nchar(.))]))

# A tibble: 2 x 2
  text                  mymatch
  <chr>                 <chr>  
1 how is the biggest ?? biggest
2 really amazing stuff  amazing

Upvotes: 1

Dave2e

Reputation: 24089

An easy way is to break the process down into 2 steps, first a list of list of all the words in each row. Then find and return the longest word from each sub list:

df <- data_frame(text = c('how is the biggest ??',
                                 'really amazing stuff'))

library(stringr)
#create a list of all words per row
splits<-str_extract_all(df$text, '\\w+', simplify = FALSE)
#find longest word and return it
sapply(splits, function(x) {x[which.max(nchar(x))]})

Upvotes: 2

How to extract the longest match?

Answers (4)

Related Questions