Reputation: 23
I want to separate one column of data that contains a code and a place name into two separate columns. I tried using tidyr
's separate command, but ran into some difficulties with the regex (I haven't used them before and am not able to figure out what I did wrong/how regex really works).
The data are pretty consistent in terms of format. Most observations start with a code and are followed by a location. Occasionally there is an observation that is just a location (no code). Here is a sample of the data:
df <- read.table(text = c("
obs name
1 01-220 location 1
2 05-23 town 3
3 District 2"), header = T)
I use the following code:
df <- df %>% separate(name, into = c("location_code", "location_name"), sep = "([0-9] [A-z])")
Resulting in (notice that the last number from location_code and first letter from location_name are missing):
obs location_code location_name
1 01-22 ocation 1
2 05-2 own 3
3 District 2 NA
My desired output is:
# obs location_code location_name
# 1 01-220 location 1
# 2 05-23 town 3
# 3 NA District 2
Thanks in advance!
Upvotes: 1
Views: 68
Reputation: 887501
We can specify the sep
with a regex lookaround.
separate(df, name, into = c("location_code", "location_time"),
"(?<=([0-9] )|\\b)(?=[A-Za-z])")
# obs location_code location_time
#1 1 01-220 location 1
#2 2 05-23 town 3
#3 3 District 2
Or with extract
extract(df, name, into = c("location_code", "location_time"), "([0-9-]*)\\s*(.*)")
# obs location_code location_time
#1 1 01-220 location 1
#2 2 05-23 town 3
#3 3 District 2
df <- structure(list(obs = 1:3, name = c("01-220 location 1", "05-23 town 3",
"District 2")), .Names = c("obs", "name"), class = "data.frame", row.names = c(NA,
-3L))
Upvotes: 4