Reputation: 512
I was wondering if there was a clean solution using data.table to the following problem possibly using other packages such as stringr.
Suppose I have the following data table
DT <- data.table(name = c("Carlos", "Henry", "John"),
ID = c("US115115, CH123232, AB155, US4445", "CH112, BB53", "US57677777"))
This looks like:
name ID
1: Carlos US115115, CH123232, AB155, US4445
2: Henry CH112, BB53
3: John US57677777
What I want to do is create another column, ID2, say, that takes the column ID and extracts only the "US identities" and creates a new column so that the final data table should look like:
name ID ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2: Henry CH112, BB53 NA
3: John US57677777 US57677777
and where each element is a string. I've been able to code a version where it takes the first "US identity" and discards the rest, but I haven't been able to find a solution that handles multiplicity.
Any help would be greatly appreciated!
Upvotes: 1
Views: 65
Reputation: 887118
We can either use str_extract
to extract the words that start with "US"
library(stringr)
DT[, ID2 := sapply(str_extract_all(ID, "\\bUS\\S*"), toString)]
DT
# name ID ID2
#1: Carlos US115115, CH123232, AB155, US4445 US115115,, US4445
#2: Henry CH112, BB53
#3: John US57677777 US57677777
Or using gsub
DT[, ID2 := gsub("(\\bUS\\S*)(*SKIP)(*F)|.", "", ID, perl = TRUE)]
Or using tidyverse
library(tidyverse)
DT %>%
mutate(ID2 = str_extract_all(ID, "\\bUS\\S*") %>%
map(toString))
Or with base R
using gregexpr
DT$ID2 <- sapply(regmatches(DT$ID, gregexpr("\\bUS\\S*", DT$ID)), toString)
Upvotes: 0
Reputation: 388982
Here are few suggestions inspired from @thelatemail and @chinsoon12
DT$ID1 <- sapply(strsplit(DT$ID, ",\\s*"), function(x)
toString(grep("^US", x, value = TRUE)))
DT
# name ID ID1
#1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
#2: Henry CH112, BB53
#3: John US57677777 US57677777
Above we are filtering using grep
, we can also use startsWith
to do the same
sapply(strsplit(DT$ID, ",\\s*"), function(x) toString(x[startsWith(x, "US")]))
You can incorporate both the above options in dplyr
chain however another option using dplyr
and tidyr
would be using separate_rows
which might be an overkill for this particular question. We can use str_subset
from stringr
which is the same as grep("^US", x, value = TRUE)
.
library(dplyr)
library(tidyr)
DT %>%
separate_rows(ID) %>%
group_by(name) %>%
summarise(ID1 = toString(ID),
ID2 = toString(stringr::str_subset(ID, "^US")))
Upvotes: 2
Reputation: 25225
A possible approach:
DT[, ID2 := sapply(strsplit(ID, ","),
function(s) paste(s[grepl("\\s*US", s)], collapse=","))]
output:
name ID ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2: Henry CH112, BB53
3: John US57677777 US57677777
Upvotes: 4