user3029412
user3029412

Reputation: 119

Creating new numeric variable in R based on specific strings from character variable

I have a character variable in a data frame, and would like to create a new numeric variable in that data frame based on the existence of specific strings in that character variable.

I've figured out one way to do this, but was wondering if there was a quicker, or more elegant approach than mine here.

Here is my solution, using ifelse and substr:

First, the values of each observation of my character variable is a random series of numbers and letters (from 0-9, and A-Z). I will create an example variable below:

library(stringi)
set.seed(100)
my.df<-data.frame("V1"=sprintf("%s%s%s", stri_rand_strings(10, 5, c('[A-B]','[0-2]')),
    stri_rand_strings(10, 4, c('[0-9]','[A-J]')), stri_rand_strings(10, 1, '[A-Z]')))

In my actual data, as I noted above, the first element of each observation can be any of the number 0-9, OR letters A-Z.

Now, what I've done is create a new variable [value] that takes on specific values (no pattern) depending upon what the first element of each observation in the V1 variable is. So, if the first element is the letter "A", the value of corresponding observation (row) in my new variable--V2--is 3, and if the first element of the V1 variable is "B" the value of the corresponding observation in my new variable V2 is 12, etc.

Here is how I have chosen to do this. It's unwieldy as my real data would need ifelse clauses for all numerals 0-9 and all letters A-Z, which would be about 35 lines or so. Here is my code:

my.df$value<-ifelse(substr(my.df$V1,1,1)=="A",3,
                    ifelse(substr(my.df$V1,1,1)=="B",12,
                           ifelse(substr(my.df$V1,1,1)=="0",44,
                                           ifelse(substr(my.df$V1,1,1)=="1",6,27))))

This yields:

        V1 value
'AABAA3122X'     1
'12110FCBCF'     6
'BBAAB5246J'    12
'20112JGEDL'    27
'BBBBA4426X'    12
'02210EDFJK'    44
'ABABB6687N'     1
'20120IAEFD'    27
'ABBBB9905A'     1
'12200HCDHU'     6

Is there a way to do this without using so many lines of code?

Thanks!

Upvotes: 1

Views: 1259

Answers (2)

jared_mamrot
jared_mamrot

Reputation: 26515

Two potential approaches based on: Canonical tidyverse method to update some values of a vector from a look-up table

library(tidyverse)
library(stringi)
library(data.table)

set.seed(100)
my.df <- data.frame("V1" = sprintf("%s%s%s",
                                   stri_rand_strings(10, 5, c('[A-B]','[0-2]')),
                                   stri_rand_strings(10, 4, c('[0-9]','[A-J]')),
                                   stri_rand_strings(10, 1, '[A-Z]')))

df <- my.df %>%
  mutate(Value = substr(V1, 1, 1))

unique(df$Value)
#> [1] "A" "1" "B" "2" "0"
lookup <- data.frame(old = c("A", 1, "B", 2, 0),
                     new = c(3, 6, 12, 27, 44))

for (i in seq_len(nrow(lookup))) {
    df$Value[df$Value == lookup$old[i]] = lookup$new[i]
}
df
#>            V1 Value
#> 1  AABAA3122X     3
#> 2  12110FCBCF     6
#> 3  BBAAB5246J    12
#> 4  20112JGEDL    27
#> 5  BBBBA4426X    12
#> 6  02210EDFJK    44
#> 7  ABABB6687N     3
#> 8  20120IAEFD    27
#> 9  ABBBB9905A     3
#> 10 12200HCDHU     6


# data.table method (fastest but doesn't retain original order of V1)
df <- my.df %>%
  mutate(Value = substr(V1, 1, 1))

setDT(df)
setDT(lookup)
setkey(df, Value)
setkey(lookup, old)

df[lookup, Value:=new, on=.(Value=old)]
df
#>             V1 Value
#>  1: 02210EDFJK    44
#>  2: 12110FCBCF     6
#>  3: 12200HCDHU     6
#>  4: 20112JGEDL    27
#>  5: 20120IAEFD    27
#>  6: AABAA3122X     3
#>  7: ABABB6687N     3
#>  8: ABBBB9905A     3
#>  9: BBAAB5246J    12
#> 10: BBBBA4426X    12

Created on 2021-10-06 by the reprex package (v2.0.1)

Upvotes: 1

akrun
akrun

Reputation: 887118

Consider either a join with key/val data or a named vector

my.df$value <- with(my.df, setNames(c(3, 12, 44, 6),
          c("A", "B", "0", "1"))[substr(V1, 1, 1)])
my.df$value[is.na(my.df$value)] <- "27"

-output

> my.df
           V1 value
1  AABAA3122X     3
2  12110FCBCF     6
3  BBAAB5246J    12
4  20112JGEDL    27
5  BBBBA4426X    12
6  02210EDFJK    44
7  ABABB6687N     3
8  20120IAEFD    27
9  ABBBB9905A     3
10 12200HCDHU     6

Upvotes: 2

Related Questions