Reputation: 49

Subset character vector by pattern

I have a character vector made up of filenames like:

vector <- c("LR1_0001_a", "LR1_0002_b", "LR02_0001_b", "LR02_0002_x", "LR3_001_c")

My goal is to subset this vector based on pattern matching the first x number of characters (dynamically), up to the first "_". The outputs would look something like this:

solution1 <- c("LR1_0001_a", "LR1_0002_b")
solution2 <- c("LR02_0001_b", "LR02_0002_b")
solution3 <- c("LR3_001_c")

I have experimented with mixtures of unique and grep but have not had any luck so far

Upvotes: 0

Answers (3)

akrun

Reputation: 887213

We can use trimws

out <- split(vector, trimws(vector, whitespace = "_[a-z]+"))

and then use list2env

list2env(out, .GlobalEnv)

Upvotes: 0

hello_friend

Reputation: 5788

Base R solution (coerce vector to data.frame):

# Split vector into list (as in ronak's answer): 

vect_list <- split(vect, sub("_.*", "", vect)) 

# Pad each vector in the list to be the same length as the longest vector: 

padded_vect_list <- lapply(vect_list, 
                           function(x){length(x) = max(lengths(vect_list)); return(x)})

# Coerce the list of vectors into a dataframe: 

df <- data.frame(do.call("cbind", padded_vect_list))

Data:

vect <- c("LR1_0001_a", "LR1_0002_b", "LR02_0001_b", "LR02_0002_x", "LR3_001_c")

Upvotes: 0

Ronak Shah

Reputation: 389047

We can use sub to remove everything after underscore "_" and split the vector.

output <- split(vector, sub('_.*', '', vector))
output

#$LR02
#[1] "LR02_0001_b" "LR02_0002_x"

#$LR1
#[1] "LR1_0001_a" "LR1_0002_b"

#$LR3
#[1] "LR3_001_c"

This returns a list of vectors, which is usually a better way to manage data instead of creating number of objects in global environment. However, if you want them as separate vectors we can use list2env.

list2env(output, .GlobalEnv)

This will create vectors with the name LR02, LR1 and LR3 respectively.

Upvotes: 3

Subset character vector by pattern

Answers (3)

Related Questions