user2340706
user2340706

Reputation: 371

string splitting from any character in a list

Is there a more elegant solution for the bottom code? Basically, I want to strsplit on a vector of characters. I want to know if there is a better solution such as with using %in% or something else.

data_d <- data.frame(id = c('A', 'B', 'C'),
                     sentence = c('1. this is A sentence',
                                  '2. this is B sentence',
                                  '3. this is C sentence'),
                     stringsAsFactors = F)
listasd <- c('A', 'B', 'C')
data_d$first <- NA
for (i in listasd)
  data_d$first <-  ifelse(str_detect(data_d$sentence, i),
                          sapply(strsplit(data_d$sentence, i), "[", 1),
                          data_d$first)

Upvotes: 1

Views: 71

Answers (3)

bartoszukm
bartoszukm

Reputation: 703

Maybe consider using the stringi package?

So maybe a little more elegant solution:

listasd <- c('C', 'A', 'B')
stri_split_regex(data_d$sentence, stri_paste(listasd, collapse="|"), n=2, simplify = TRUE)[,1]

It returns a vector of interesting parts of sentences without using sapply:

[1] "1. this is " "2. this is " "3. this is "

So you can make a solution without a loop, which is extremely slow in R:

data_d$first <- stri_split_regex(data_d$sentence, stri_paste(listasd, collapse="|"), n=2, simplify = TRUE)[,1]

Upvotes: 1

fishtank
fishtank

Reputation: 3728

This gives the same output:

sapply(strsplit(data_d$sentence, c('A','B','C')),'[',1)
# [1] "1. this is " "2. this is " "3. this is "

According to ?split, the split argument can take character vector which are recycled along x.

If you try:

sapply(strsplit(data_d$sentence, c('C','B','A')),'[',1)
# "1. this is A sentence" "2. this is "           "3. this is C sentence"

still works as there is nothing to split in the 1st and 3rd string.

Upvotes: 0

alistaire
alistaire

Reputation: 43334

You can just use gsub. The regex finds from a capital letter to the end of the line. If you have other capitals in your sentence, you'll need to adjust it.

data_d$first <- gsub('[A-Z].*$', '', data_d$sentence)

> data_d
  id              sentence       first
1  A 1. this is A sentence 1. this is 
2  B 2. this is B sentence 2. this is 
3  C 3. this is C sentence 3. this is 

Upvotes: 0

Related Questions