Extracting substring by positions in pipe

Question

I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.

I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.

data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>% 
   mutate(coor =  str_locate_all(id, "\s"),
   name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )

Ronak Shah · Accepted Answer

You can use regex to extract what you want.

Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.

sub('^#\w+\s(\w+\s\w+).*', '\1', data$id)
#[1] "Zoe Boston" "Jane Rome"

^# - starts with hash

\w+ - A word

\s - Whitespace

( - start of capture group

\w+ - A word

followed by \s - whitespace

\w+ - another word

) - end of capture group.

.* - remaining string.

The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.

library(dplyr)
library(stringr)
library(purrr)

data %>%
   mutate(coor =  str_locate_all(id, "\s"), 
          start = map_dbl(coor, `[`, 1) + 1, 
          end = map_dbl(coor, `[`, 3) - 1,
          name = str_sub(id, start, end))

# A tibble: 2 x 2
#  id                                                          name      
#                                                              
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0)             Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome

Extracting substring by positions in pipe

Answers (2)

Related Questions