Reputation: 125
I have found answers for this for other languages but not R so I apologize if this is a duplicate.
I am using the separate()
function in R and am curious if it is possible to split a column into two columns whenever a certain character appears. For example:
data
ID Value
0011ER00 91346
0011ER46 59961
0011ER4610 53968
0011ER4620 55083
0011ER4630 51112
0011ER4640 67480
0011ER4650 45612
0011ER4660 53968
And I would like to split the ID column so that the data looks like this:
data
Code Region Value
0011 ER00 91346
0011 ER46 59961
0011 ER4610 53968
0011 ER4620 55083
0011 ER4630 51112
0011 ER4640 67480
0011 ER4650 45612
0011 ER4660 53968
I tried the code:
data %>%
separate(ID, into = c("Code", "Region"), sep = "E")
but then I am not able to preserve the E in the result and get:
Code Region Value
0011 ER00 91346
0011 R46 59961
0011 R4610 53968
0011 R4620 55083
0011 R4630 51112
0011 R4640 67480
0011 R4650 45612
0011 R4660 53968
Upvotes: 1
Views: 64
Reputation: 18681
We can use extract
from tidyr
. The regex
argument matches the capture groups (in parentheses) and separate them into the variables specified in the into
argument. Note that the number of capture groups must equal the number of into
variable names:
library(dplyr)
library(tidyr)
data %>%
extract(ID, c("Code", "Region"), regex = "(^\\d+)([[:alpha:]]+\\d+)")
Output:
Code Region Value
1 0011 ER00 91346
2 0011 ER46 59961
3 0011 ER4610 53968
4 0011 ER4620 55083
5 0011 ER4630 51112
6 0011 ER4640 67480
7 0011 ER4650 45612
8 0011 ER4660 53968
Data:
data <- structure(list(ID = structure(1:8, .Label = c("0011ER00", "0011ER46",
"0011ER4610", "0011ER4620", "0011ER4630", "0011ER4640", "0011ER4650",
"0011ER4660"), class = "factor"), Value = c(91346L, 59961L, 53968L,
55083L, 51112L, 67480L, 45612L, 53968L)), class = "data.frame", row.names = c(NA,
-8L))
extract
vs separate
:In contrast to using separate
(in @akrun's answer), extract
matches specifically the contents of your output variables, while separate
matches the splitting character (or a zero-length position in the case of lookarounds). Both are valid and have their pros and cons.
Depending on your use case, separate
is generally more useful if you want to split your variable into multiple columns based on some delimiter, without needing to know what the contents of your output variables should be.
extract
is more useful when you have to be sure that you are grabbing the correct characters from your input for each output variable.
In OP's case it doesn't really matter since he/she is only splitting the input into two variables with one "delimiter" and the concatenation of the output exactly matches the input.
Upvotes: 1
Reputation: 887128
An option would be to a regex lookaround to split at the junction between a digit and character. Here, we are also making it more generalizable by splitting at the junction of any digit followed by letter 'E'
library(dplyr)
library(tidyr)
data %>%
separate(ID, into = c("Code", "Region"), sep = "(?<=[0-9])(?=E)")
# Code Region Value
#1 0011 ER00 91346
#2 0011 ER46 59961
#3 0011 ER4610 53968
#4 0011 ER4620 55083
#5 0011 ER4630 51112
#6 0011 ER4640 67480
#7 0011 ER4650 45612
#8 0011 ER4660 53968
Or another option is to insert a delimiter before the ,
with str_replace
and then do the separate
data %>%
mutate(ID = str_replace(ID, "E", ",E")) %>%
separate(ID, into = c("Code", "Region"))
data <- structure(list(ID = c("0011ER00", "0011ER46", "0011ER4610", "0011ER4620",
"0011ER4630", "0011ER4640", "0011ER4650", "0011ER4660"), Value = c(91346L,
59961L, 53968L, 55083L, 51112L, 67480L, 45612L, 53968L)),
class = "data.frame", row.names = c(NA,
-8L))
Upvotes: 3