Reputation: 369
I would like to extract substring from every row of the id
column of a tibble. I am interested always in a region between 1st and 3rd space of original id
. The resulted substring, so Zoe Boston
and Jane Rome
, would go to the new column - name
.
I tried to get the positions of "spaces" in every id with str_locate_all
and then use positions to use str_sub
. However I cannot extract the positions correctly.
data <- tibble(id = c("#1265746 Zoe Boston 58962 st. Victory cont_1.0)", "#958463279246 Jane Rome 874593.01 musician band: XYZ 985147") ) %>%
mutate(coor = str_locate_all(id, "\\s"),
name = str_sub(id, start = coor[[1]], end = coor[[3]] ) )
Upvotes: 1
Views: 1001
Reputation: 9267
Another possible solution using stringr
and purrr
packages
library(stringr)
library(purrr)
library(dplyr)
data %>%
mutate(name = map_chr(str_split(id, " "), ~paste(unlist(.)[2:3], collapse = " ")))
Explanation:
str_split(id, " ")
we create a list of the terms that are separated inside id
by a whitespacemap_chr
is useful to take each one of these lists, and apply the following function to them: unlist the list, take the elements in positions 2 and 3 (which are the name
we want) and then collapse them with a whitespace between themOutput
# A tibble: 2 x 2
# id name
# <chr> <chr>
# 1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
# 2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Upvotes: 0
Reputation: 389175
You can use regex to extract what you want.
Assuming you have stored your tibble in data
, you can use sub
to extract 1st and 2nd word.
sub('^#\\w+\\s(\\w+\\s\\w+).*', '\\1', data$id)
#[1] "Zoe Boston" "Jane Rome"
^#
- starts with hash
\\w+
- A word
\\s
- Whitespace
(
- start of capture group
\\w+
- A word
followed by \\s
- whitespace
\\w+
- another word
)
- end of capture group.
.*
- remaining string.
The str_locate
is more complex, since it first returns the position of whitespace then you need to select the end of 1st whitespace and start of 3rd and then use str_sub
to extract text between those positions.
library(dplyr)
library(stringr)
library(purrr)
data %>%
mutate(coor = str_locate_all(id, "\\s"),
start = map_dbl(coor, `[`, 1) + 1,
end = map_dbl(coor, `[`, 3) - 1,
name = str_sub(id, start, end))
# A tibble: 2 x 2
# id name
# <chr> <chr>
#1 #1265746 Zoe Boston 58962 st. Victory cont_1.0) Zoe Boston
#2 #958463279246 Jane Rome 874593.01 musician band: XYZ 985147 Jane Rome
Upvotes: 1