Reputation: 1112

extract the first part of each string in a data frame in r

I have a data frame M. I would like to extract the first part of each string separated by ":". I used strsplit but the result is a large character not a data frame. Could someone please help with this?

M <- read.table(text=
"1/1:205,54,0:18:0:57 1/1:141,39,0:13:0:42   0/0:0,54,255:18:0:45 1/1:174,48,0:16:0:51 0/0:0,84,255:28:0:75 
 0/0:0,78,255:26:0:99 0/0:0,63,255:21:0:86   0/0:0,45,255:15:0:68 0/0:0,48,255:16:0:71 0/0:0,132,255:44:0:99
 0/0:0,78,255:26:0:89 0/0:0,78,255:26:0:89   0/0:0,36,255:12:0:47 0/0:0,33,255:11:0:44 0/0:0,108,255:36:0:99
 0/0:0,75,255:25:0:99 0/0:0,54,255:18:0:78   0/0:0,69,255:23:0:93 0/0:0,33,255:11:0:57 0/0:0,96,255:32:0:99 
 0/0:0,60,75:21:0:74  0/0:0,51,84:17:0:65    0/0:0,48,64:17:0:62  0/0:0,42,65:15:0:56  0/0:0,84,99:28:0:98 ",
head=F, stringsAsFactors=F)
S <- sapply(strsplit(M, ":"), "[", 1)

Upvotes: 2

Answers (3)

akrun

Reputation: 887971

It may not be best to use strsplit as we are only interested in a substring. Assuming that the OP is interested in understanding how strsplit can be used for this example dataset, a modification of the OP's code would be to use a nested lapply/sapply loop.

 M[] <- lapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1))
 M
 #   V1  V2  V3  V4  V5
 #1 1/1 1/1 0/0 1/1 0/0
 #2 0/0 0/0 0/0 0/0 0/0
 #3 0/0 0/0 0/0 0/0 0/0
 #4 0/0 0/0 0/0 0/0 0/0
 #5 0/0 0/0 0/0 0/0 0/0

Or as the columns are all similar, we can unlist, use strsplit and assign the original dataset with the output so that we can keep the original structure intact for the output we got.

  M[] <- sapply(strsplit(unlist(M), ':'),'[',1)

Or a faster option would be using stri_extract_first from stringi to extract the the characters that are not :.

  library(stringi)
  M[] <- stri_extract_first(unlist(M), regex='[^:]+')

Upvotes: 5

Steven Beaupré

Reputation: 21641

Try:

dplyr::mutate_each(M, funs(sub("(.*?)(:.*)", "\\1" , .)))

Which gives:

#   V1  V2  V3  V4  V5
#1 1/1 1/1 0/0 1/1 0/0
#2 0/0 0/0 0/0 0/0 0/0
#3 0/0 0/0 0/0 0/0 0/0
#4 0/0 0/0 0/0 0/0 0/0
#5 0/0 0/0 0/0 0/0 0/0

Upvotes: 4

Rich Scriven

Reputation: 99391

You can use sub()

M[] <- lapply(M, sub, pattern = ":.*", replacement = "")
M
#    V1  V2  V3  V4  V5
# 1 1/1 1/1 0/0 1/1 0/0
# 2 0/0 0/0 0/0 0/0 0/0
# 3 0/0 0/0 0/0 0/0 0/0
# 4 0/0 0/0 0/0 0/0 0/0
# 5 0/0 0/0 0/0 0/0 0/0

The above will overwrite the original M data. If you do not wish to overwrite M, assign it to a new variable name first or just use as.data.frame() around lapply()

as.data.frame(lapply(M, sub, pattern = ":.*", replacement = ""))

Upvotes: 4

extract the first part of each string in a data frame in r

Answers (3)

Related Questions