user3354212
user3354212

Reputation: 1112

extract the first part of each string in a data frame in r

I have a data frame M. I would like to extract the first part of each string separated by ":". I used strsplit but the result is a large character not a data frame. Could someone please help with this?

M <- read.table(text=
"1/1:205,54,0:18:0:57 1/1:141,39,0:13:0:42   0/0:0,54,255:18:0:45 1/1:174,48,0:16:0:51 0/0:0,84,255:28:0:75 
 0/0:0,78,255:26:0:99 0/0:0,63,255:21:0:86   0/0:0,45,255:15:0:68 0/0:0,48,255:16:0:71 0/0:0,132,255:44:0:99
 0/0:0,78,255:26:0:89 0/0:0,78,255:26:0:89   0/0:0,36,255:12:0:47 0/0:0,33,255:11:0:44 0/0:0,108,255:36:0:99
 0/0:0,75,255:25:0:99 0/0:0,54,255:18:0:78   0/0:0,69,255:23:0:93 0/0:0,33,255:11:0:57 0/0:0,96,255:32:0:99 
 0/0:0,60,75:21:0:74  0/0:0,51,84:17:0:65    0/0:0,48,64:17:0:62  0/0:0,42,65:15:0:56  0/0:0,84,99:28:0:98 ",
head=F, stringsAsFactors=F)
S <- sapply(strsplit(M, ":"), "[", 1)

Upvotes: 2

Views: 2174

Answers (3)

akrun
akrun

Reputation: 887128

It may not be best to use strsplit as we are only interested in a substring. Assuming that the OP is interested in understanding how strsplit can be used for this example dataset, a modification of the OP's code would be to use a nested lapply/sapply loop.

 M[] <- lapply(M, function(x) sapply(strsplit(as.character(x), ':'),'[',1))
 M
 #   V1  V2  V3  V4  V5
 #1 1/1 1/1 0/0 1/1 0/0
 #2 0/0 0/0 0/0 0/0 0/0
 #3 0/0 0/0 0/0 0/0 0/0
 #4 0/0 0/0 0/0 0/0 0/0
 #5 0/0 0/0 0/0 0/0 0/0

Or as the columns are all similar, we can unlist, use strsplit and assign the original dataset with the output so that we can keep the original structure intact for the output we got.

  M[] <- sapply(strsplit(unlist(M), ':'),'[',1)

Or a faster option would be using stri_extract_first from stringi to extract the the characters that are not :.

  library(stringi)
  M[] <- stri_extract_first(unlist(M), regex='[^:]+')

Upvotes: 5

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21621

Try:

dplyr::mutate_each(M, funs(sub("(.*?)(:.*)", "\\1" , .)))

Which gives:

#   V1  V2  V3  V4  V5
#1 1/1 1/1 0/0 1/1 0/0
#2 0/0 0/0 0/0 0/0 0/0
#3 0/0 0/0 0/0 0/0 0/0
#4 0/0 0/0 0/0 0/0 0/0
#5 0/0 0/0 0/0 0/0 0/0

Upvotes: 4

Rich Scriven
Rich Scriven

Reputation: 99331

You can use sub()

M[] <- lapply(M, sub, pattern = ":.*", replacement = "")
M
#    V1  V2  V3  V4  V5
# 1 1/1 1/1 0/0 1/1 0/0
# 2 0/0 0/0 0/0 0/0 0/0
# 3 0/0 0/0 0/0 0/0 0/0
# 4 0/0 0/0 0/0 0/0 0/0
# 5 0/0 0/0 0/0 0/0 0/0

The above will overwrite the original M data. If you do not wish to overwrite M, assign it to a new variable name first or just use as.data.frame() around lapply()

as.data.frame(lapply(M, sub, pattern = ":.*", replacement = ""))

Upvotes: 4

Related Questions