Creating ID numbers based on two string columns

Question

I have a data frame structured as follows:

df <- structure(list(NAME1    =  c("AAA","CCC","BBB","BBB"), 
                 NAME2        =  c("BBB", "AAA","DDD","AAA"),
                 AMT          =  c(10, 20, 30, 50)), .Names=c("NAME1","NAME2", "AMT"), 
            row.names         = c("1", "2", "3", "4"), class =("data.frame"))

I would like to create two ID variables (ID1 and ID2) based on two string columns, namely NAME1 and NAME2. The two columns may share values so the IDs have to be consistent. The desired data frame should look like this:

df <- structure(list(NAME1    =  c("AAA","CCC", "BBB", "BBB"), 
                 NAME2        =  c("BBB", "AAA","DDD", "AAA"),
                 ID1          =  c(1,3,2,2),
                 ID2          =  c(2,1,4,1),
                 AMT          =  c(10,20,30,50)),
            .Names            =  c("NAME1","NAME2","ID1","ID2"), 
            row.names         =  c("1", "2", "3", "4"), class =("data.frame"))

Your suggestions would be appreciated.

Cheers.

steveb · Accepted Answer

The following answer would work and could be modified to allow IDs to be assigned within more than data frame (e.g. df1, df2); this assumes the variable afact below is created with all of the desired factor levels. This also uses dplyr to create the new columns.

library(dplyr)

adf <- structure(list(NAME1    =  c("AAA","CCC","BBB","BBB"), 
                 NAME2        =  c("BBB", "AAA","DDD","AAA"),
                 AMT          =  c(10, 20, 30, 50)), .Names=c("NAME1","NAME2", "AMT"), 
            row.names         = c("1", "2", "3", "4"), class =("data.frame"))

## Create factor based on all unique values.
## include all variables (e.g. NAME1) needed in factor.
afact      <- as.factor(unique(sort(c(adf$NAME1, adf$NAME2))))
## Factor level to numeric value.
num.lookup <- function(x) { as.numeric(afact[afact == x])}

# Create the new ID columns using the factor 'afact' and 'num.lookup'
# to assign numeric values consistant across columns.
adf %>%
  mutate(ID1 = sapply(NAME1, num.lookup),
         ID2 = sapply(NAME2, num.lookup))

#   NAME1 NAME2 AMT ID1 ID2
# 1   AAA   BBB  10   1   2
# 2   CCC   AAA  20   3   1
# 3   BBB   DDD  30   2   4
# 4   BBB   AAA  50   2   1

Creating ID numbers based on two string columns

Answers (2)

Related Questions