Reputation: 51
##id## ##initiativen##
1 abc 2a 2 cde 2b 3 efd a 4 geh c 5 jytd 5v 6 jydjytd e
Hello, I have something similar to this, just a lot bigger and I was wondering which is the most efficient way to divide the column initiativen into two columns, one containing the numbers (2,2,5,4) and one containing the letters or the blank space. it has to be a general formula as the data frame I need to apply it too is quite big. The letters correspond to a particular initiative number but the first initiative number is not indicated and "a" correspond to initiative number 2.
I would love it to look like something like that with the letters substituted by numbers (blank=1, a=2, b=3 etc..)
id initiativen question
abc 2 2
cde 3 2
efd 2 N/A
geh 4 N/A
jytd 23 5
jydjytd 6 N/A
bfdhslbf 1 3
I have tried to use "separate" but it doesn't really work and doesn't solve the problem of the first initiative having no corresponding letter. Any help or suggestion would be extremely welcomed and helpful.
Thank you so much:)
Upvotes: 1
Views: 64
Reputation: 1488
For the second column you can use regular expression to only keep numeric values:
df$initiativen <- gsub("[^0-9]", "", df$initiativen)
Upvotes: 0
Reputation: 33498
Using data.table
# Step one
setDT(df)
df[, ":="(
question = gsub("[a-z]", "", initiativen),
initiativen = match(gsub("[0-9]", "", initiativen), letters, nomatch = 0) + 1L
)
]
df
id initiativen question
1: abc 2 2
2: cde 3 2
3: efd 2
4: geh 4
5: jytd 23 5
6: jydjytd 6
7: vbdjfkb 1 4
# Then some tidying
df[, question := ifelse(nzchar(question), question, NA)]
df
id initiativen question
1: abc 2 2
2: cde 3 2
3: efd 2 <NA>
4: geh 4 <NA>
5: jytd 23 5
6: jydjytd 6 <NA>
7: vbdjfkb 1 4
Data
df <- data.frame(
id = c("abc", "cde", "efd", "geh", "jytd", "jydjytd", "vbdjfkb"),
initiativen = c("2a", "2b", "a", "c", "5v", "e", "4"),
stringsAsFactors = FALSE
)
Edit
Can also be done in one step:
df[, question := gsub("[a-z]", "", initiativen)
][, ":="(
question = ifelse(nzchar(question), question, NA),
initiativen = match(gsub("[0-9]", "", initiativen), letters, nomatch = 0) + 1L
)
]
Upvotes: 0
Reputation: 50668
How about the following tidyverse
solution?
library(tidyverse);
df %>%
separate(initiativen, into = c("p1", "p2"), sep = "(?<=[0-9])(?=[a-z])") %>%
mutate(
initiativen = case_when(
str_detect(p1, "[a-z]") ~ p1,
str_detect(p2, "[a-z]") ~ p2),
question = case_when(
str_detect(p1, "[0-9]") ~ p1,
str_detect(p2, "[0-9]") ~ p2)) %>%
mutate(initiativen = ifelse(is.na(initiativen), 1, match(initiativen, letters) + 1)) %>%
select(-p1, -p2)
# id initiativen question
#1 abc 2 2
#2 cde 3 2
#3 efd 2 <NA>
#4 geh 4 <NA>
#5 jytd 23 5
#6 jydjytd 6 <NA>
#7 vbdjfkb 1 4
Note that the warning can be safely ignored as it stems from the missing fields when separate
ing.
Explanation: We use a positive look-behind and look-ahead to split entries in initiativen
into two parts p1
and p2
; we then fill initiativen
and question
with entries from p1
or p2
depending on whether they contain a number "[0-9]"
or a character "[a-z]"
; convert characters to numbers with match(initiativen, letters)
and finally clean the data.frame
.
df <- read.table(text =
" id initiativen
1 abc 2a
2 cde 2b
3 efd a
4 geh c
5 jytd 5v
6 jydjytd e
7 vbdjfkb 4", row.names = 1)
Upvotes: 1