korone
korone

Reputation: 125

Creating ID numbers based on two string columns

I have a data frame structured as follows:

df <- structure(list(NAME1    =  c("AAA","CCC","BBB","BBB"), 
                 NAME2        =  c("BBB", "AAA","DDD","AAA"),
                 AMT          =  c(10, 20, 30, 50)), .Names=c("NAME1","NAME2", "AMT"), 
            row.names         = c("1", "2", "3", "4"), class =("data.frame"))

I would like to create two ID variables (ID1 and ID2) based on two string columns, namely NAME1 and NAME2. The two columns may share values so the IDs have to be consistent. The desired data frame should look like this:

df <- structure(list(NAME1    =  c("AAA","CCC", "BBB", "BBB"), 
                 NAME2        =  c("BBB", "AAA","DDD", "AAA"),
                 ID1          =  c(1,3,2,2),
                 ID2          =  c(2,1,4,1),
                 AMT          =  c(10,20,30,50)),
            .Names            =  c("NAME1","NAME2","ID1","ID2"), 
            row.names         =  c("1", "2", "3", "4"), class =("data.frame"))

Your suggestions would be appreciated.

Cheers.

Upvotes: 1

Views: 253

Answers (2)

steveb
steveb

Reputation: 5532

The following answer would work and could be modified to allow IDs to be assigned within more than data frame (e.g. df1, df2); this assumes the variable afact below is created with all of the desired factor levels. This also uses dplyr to create the new columns.

library(dplyr)

adf <- structure(list(NAME1    =  c("AAA","CCC","BBB","BBB"), 
                 NAME2        =  c("BBB", "AAA","DDD","AAA"),
                 AMT          =  c(10, 20, 30, 50)), .Names=c("NAME1","NAME2", "AMT"), 
            row.names         = c("1", "2", "3", "4"), class =("data.frame"))

## Create factor based on all unique values.
## include all variables (e.g. NAME1) needed in factor.
afact      <- as.factor(unique(sort(c(adf$NAME1, adf$NAME2))))
## Factor level to numeric value.
num.lookup <- function(x) { as.numeric(afact[afact == x])}

# Create the new ID columns using the factor 'afact' and 'num.lookup'
# to assign numeric values consistant across columns.
adf %>%
  mutate(ID1 = sapply(NAME1, num.lookup),
         ID2 = sapply(NAME2, num.lookup))

#   NAME1 NAME2 AMT ID1 ID2
# 1   AAA   BBB  10   1   2
# 2   CCC   AAA  20   3   1
# 3   BBB   DDD  30   2   4
# 4   BBB   AAA  50   2   1

Upvotes: 0

Evan Friedland
Evan Friedland

Reputation: 3194

You should create a vector of both and turn it into a factor and then into numeric. Then you can properly subset it with the number of rows in the df and place them back:

newIDs <- as.numeric(as.factor(c(df$NAME1, df$NAME2)))

df$ID1 <- newIDs[1:nrow(df)]
df$ID2 <- newIDs[-c(1:nrow(df))]

Upvotes: 1

Related Questions