plausibly_exogenous
plausibly_exogenous

Reputation: 512

Subsetting character string and returning string

I was wondering if there was a clean solution using data.table to the following problem possibly using other packages such as stringr.

Suppose I have the following data table

DT <- data.table(name = c("Carlos", "Henry", "John"), 
    ID = c("US115115, CH123232, AB155, US4445", "CH112, BB53", "US57677777"))

This looks like:

     name                                ID
1: Carlos US115115, CH123232, AB155, US4445
2:  Henry                       CH112, BB53
3:   John                        US57677777

What I want to do is create another column, ID2, say, that takes the column ID and extracts only the "US identities" and creates a new column so that the final data table should look like:

     name                                ID              ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2:  Henry                       CH112, BB53               NA
3:   John                        US57677777       US57677777                     

and where each element is a string. I've been able to code a version where it takes the first "US identity" and discards the rest, but I haven't been able to find a solution that handles multiplicity.

Any help would be greatly appreciated!

Upvotes: 1

Views: 65

Answers (3)

akrun
akrun

Reputation: 887118

We can either use str_extract to extract the words that start with "US"

library(stringr)
DT[, ID2 := sapply(str_extract_all(ID, "\\bUS\\S*"), toString)]
DT
#     name                                ID               ID2
#1: Carlos US115115, CH123232, AB155, US4445 US115115,, US4445
#2:  Henry                       CH112, BB53                  
#3:   John                        US57677777        US57677777

Or using gsub

DT[, ID2 := gsub("(\\bUS\\S*)(*SKIP)(*F)|.", "", ID, perl = TRUE)]

Or using tidyverse

library(tidyverse)
DT %>%
    mutate(ID2 = str_extract_all(ID, "\\bUS\\S*") %>%
                   map(toString))

Or with base R using gregexpr

DT$ID2 <- sapply(regmatches(DT$ID, gregexpr("\\bUS\\S*", DT$ID)), toString)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

Here are few suggestions inspired from @thelatemail and @chinsoon12

DT$ID1 <- sapply(strsplit(DT$ID, ",\\s*"), function(x) 
                            toString(grep("^US", x, value = TRUE)))
DT
#     name                                ID              ID1
#1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
#2:  Henry                       CH112, BB53                 
#3:   John                        US57677777       US57677777

Above we are filtering using grep, we can also use startsWith to do the same

sapply(strsplit(DT$ID, ",\\s*"), function(x) toString(x[startsWith(x, "US")]))

You can incorporate both the above options in dplyr chain however another option using dplyr and tidyr would be using separate_rows which might be an overkill for this particular question. We can use str_subset from stringr which is the same as grep("^US", x, value = TRUE).

library(dplyr)
library(tidyr)

DT %>%
  separate_rows(ID) %>%
  group_by(name) %>%
  summarise(ID1 = toString(ID), 
            ID2 = toString(stringr::str_subset(ID, "^US")))

Upvotes: 2

chinsoon12
chinsoon12

Reputation: 25225

A possible approach:

DT[, ID2 := sapply(strsplit(ID, ","), 
    function(s) paste(s[grepl("\\s*US", s)], collapse=","))]

output:

     name                                ID              ID2
1: Carlos US115115, CH123232, AB155, US4445 US115115, US4445
2:  Henry                       CH112, BB53                 
3:   John                        US57677777       US57677777

Upvotes: 4

Related Questions