quantumofnolace
quantumofnolace

Reputation: 125

Splitting character string at a particular character and preserving that character

I have found answers for this for other languages but not R so I apologize if this is a duplicate.

I am using the separate() function in R and am curious if it is possible to split a column into two columns whenever a certain character appears. For example:

data

    ID          Value
0011ER00      91346
0011ER46      59961
0011ER4610    53968
0011ER4620    55083
0011ER4630    51112
0011ER4640    67480
0011ER4650    45612
0011ER4660    53968

And I would like to split the ID column so that the data looks like this:

data

Code    Region  Value
0011    ER00    91346
0011    ER46    59961
0011    ER4610  53968
0011    ER4620  55083
0011    ER4630  51112
0011    ER4640  67480
0011    ER4650  45612
0011    ER4660  53968

I tried the code:

data %>% 
  separate(ID, into = c("Code", "Region"), sep = "E")

but then I am not able to preserve the E in the result and get:

Code    Region  Value
0011    ER00    91346
0011    R46     59961
0011    R4610   53968
0011    R4620   55083
0011    R4630   51112
0011    R4640   67480
0011    R4650   45612
0011    R4660   53968

Upvotes: 1

Views: 64

Answers (2)

acylam
acylam

Reputation: 18681

We can use extract from tidyr. The regex argument matches the capture groups (in parentheses) and separate them into the variables specified in the into argument. Note that the number of capture groups must equal the number of into variable names:

library(dplyr)
library(tidyr)

data %>%
  extract(ID, c("Code", "Region"), regex = "(^\\d+)([[:alpha:]]+\\d+)")

Output:

  Code Region Value
1 0011   ER00 91346
2 0011   ER46 59961
3 0011 ER4610 53968
4 0011 ER4620 55083
5 0011 ER4630 51112
6 0011 ER4640 67480
7 0011 ER4650 45612
8 0011 ER4660 53968

Data:

data <- structure(list(ID = structure(1:8, .Label = c("0011ER00", "0011ER46", 
"0011ER4610", "0011ER4620", "0011ER4630", "0011ER4640", "0011ER4650", 
"0011ER4660"), class = "factor"), Value = c(91346L, 59961L, 53968L, 
55083L, 51112L, 67480L, 45612L, 53968L)), class = "data.frame", row.names = c(NA, 
-8L))

extract vs separate:

In contrast to using separate (in @akrun's answer), extract matches specifically the contents of your output variables, while separate matches the splitting character (or a zero-length position in the case of lookarounds). Both are valid and have their pros and cons.

Depending on your use case, separate is generally more useful if you want to split your variable into multiple columns based on some delimiter, without needing to know what the contents of your output variables should be.

extract is more useful when you have to be sure that you are grabbing the correct characters from your input for each output variable.

In OP's case it doesn't really matter since he/she is only splitting the input into two variables with one "delimiter" and the concatenation of the output exactly matches the input.

Upvotes: 1

akrun
akrun

Reputation: 887128

An option would be to a regex lookaround to split at the junction between a digit and character. Here, we are also making it more generalizable by splitting at the junction of any digit followed by letter 'E'

library(dplyr)
library(tidyr)
data %>% 
   separate(ID, into = c("Code", "Region"), sep = "(?<=[0-9])(?=E)")
#  Code Region Value
#1 0011   ER00 91346
#2 0011   ER46 59961
#3 0011 ER4610 53968
#4 0011 ER4620 55083
#5 0011 ER4630 51112
#6 0011 ER4640 67480
#7 0011 ER4650 45612
#8 0011 ER4660 53968

Or another option is to insert a delimiter before the , with str_replace and then do the separate

data %>%
   mutate(ID = str_replace(ID, "E", ",E")) %>% 
   separate(ID, into = c("Code", "Region"))

data

data <- structure(list(ID = c("0011ER00", "0011ER46", "0011ER4610", "0011ER4620", 
"0011ER4630", "0011ER4640", "0011ER4650", "0011ER4660"), Value = c(91346L, 
59961L, 53968L, 55083L, 51112L, 67480L, 45612L, 53968L)), 
 class = "data.frame", row.names = c(NA, 
-8L))

Upvotes: 3

Related Questions