Reputation: 19395
Consider this simple example
library(stringr)
library(dplyr)
dataframe <- data_frame(text = c('how is the biggest ??',
'really amazing stuff'))
# A tibble: 2 x 1
text
<chr>
1 how is the biggest ??
2 really amazing stuff
I need to extract some terms based on a regex
expression, but only extract the term that is the longest.
So far, I was able to only extract the first match (not necessary the longest) using str_extract
.
> dataframe %>% mutate(mymatch = str_extract(text, regex('\\w+')))
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? how
2 really amazing stuff really
I tried to play with str_extract_all
but I cant find an efficient syntax.
Output should be:
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Any ideas? Thanks!
Upvotes: 3
Views: 801
Reputation: 46876
As a variant of other answers, I'd suggest writing a function that does the manipuation
longest_match <- function(x, pattern) {
matches <- str_match_all(x, pattern)
purrr::map_chr(matches, ~ .[which.max(nchar(.))])
}
Then use it
dataframe %>%
mutate(mymatch = longest_match(text, "\\w+"))
By way of commentary, it seems better practice to isolate the function that does the new stuff longest_match()
from the manipulations enabled by mutate()
. For instance, the function is easy to test, can be used in other circumstances, and can be modified ('return the last rather than first longest match') independently of the data transformation step.. There's no real value in sticking everything into one line, so it makes sense to write lines of code that logically accomplish one thing -- find all matches, map from all matches to longest, ... purrr::map_chr()
is better than sapply()
because it is more robust -- it guarantees that the result is a character vector, so that something like
> df1 = dataframe[FALSE,]
> df1 %>% mutate(mymatch = longest_match(text, "\\w+"))
# A tibble: 0 x 2
# ... with 2 variables: text <chr>, mymatch <chr>
'does the right thing', i.e., mymatch
is a character vector (sapply()
would return a list in this case).
Upvotes: 2
Reputation: 18691
You can do something like this:
library(stringr)
library(dplyr)
dataframe %>%
mutate(mymatch = sapply(str_extract_all(text, '\\w+'),
function(x) x[nchar(x) == max(nchar(x))][1]))
With purrr
:
library(purrr)
dataframe %>%
mutate(mymatch = map_chr(str_extract_all(text, '\\w+'),
~ .[nchar(.) == max(nchar(.))][1]))
Result:
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Note:
If there is a tie, this takes the first one.
Data:
dataframe <- data_frame(text = c('how is the biggest ??',
'really amazing biggest stuff'))
Upvotes: 7
Reputation: 18425
Or, using purrr
...
library(dplyr)
library(purrr)
library(stringr)
dataframe %>% mutate(mymatch=map_chr(str_extract_all(text,"\\w+"),
~.[which.max(nchar(.))]))
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Upvotes: 1
Reputation: 24089
An easy way is to break the process down into 2 steps, first a list of list of all the words in each row. Then find and return the longest word from each sub list:
df <- data_frame(text = c('how is the biggest ??',
'really amazing stuff'))
library(stringr)
#create a list of all words per row
splits<-str_extract_all(df$text, '\\w+', simplify = FALSE)
#find longest word and return it
sapply(splits, function(x) {x[which.max(nchar(x))]})
Upvotes: 2