amk
amk

Reputation: 369

Extracting substring by positions in pipe

I would like to extract substring from every row of the id column of a tibble. I am interested always in a region between 1st and 3rd space of original id. The resulted substring, so Zoe Boston and Jane Rome, would go to the new column - name.

I tried to get the positions of "spaces" in every id with str_locate_all and then use positions to use str_sub. However I cannot extract the positions correctly.

data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>% 
   mutate(coor =  str_locate_all(id, "\\s"),
   name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )

Upvotes: 1

Views: 1001

Answers (2)

Ric S
Ric S

Reputation: 9267

Another possible solution using stringr and purrr packages

library(stringr)
library(purrr)
library(dplyr)

data %>%
  mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))

Explanation:

  • in str_split(id, " ") we create a list of the terms that are separated inside id by a whitespace
  • map_chr is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name we want) and then collapse them with a whitespace between them

Output

# A tibble: 2 x 2
#   id                                                          name      
#   <chr>                                                       <chr>     
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0)             Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome 

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389175

You can use regex to extract what you want.

Assuming you have stored your tibble in data, you can use sub to extract 1st and 2nd word.

sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome" 

^# - starts with hash

\\w+ - A word

\\s - Whitespace

( - start of capture group

\\w+ - A word

followed by \\s - whitespace

\\w+ - another word

) - end of capture group.

.* - remaining string.


The str_locate is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub to extract text between those positions.

library(dplyr)
library(stringr)
library(purrr)

data %>%
   mutate(coor =  str_locate_all(id, "\\s"), 
          start = map_dbl(coor, `[`, 1) + 1, 
          end = map_dbl(coor, `[`, 3) - 1,
          name = str_sub(id, start, end))

# A tibble: 2 x 2
#  id                                                          name      
#  <chr>                                                       <chr>     
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0)             Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome 

Upvotes: 1

Related Questions