JerryN
JerryN

Reputation: 2506

r data.table how to do a lookup from a different data.table

The code below does what I want for a simple table. The mapping that takes place in the statement with on works perfectly. But I also have the situation with multiple countries that need to be assigned potentially to multiple regions and the result stored in the regions column is more challenging

library(data.table)
testDT <- data.table(country = c("Algeria", "Egypt", "United States", "Brazil"))
testDTcomplicated <- data.table(country = c("Algeria, Ghana, Sri Lanka", "Egypt", "United States, Argentina", "Brazil"))
regionLookup <- data.table(countrylookup = c("Algeria", "Argentina", "Egypt", "United States", "Brazil", "Ghana", "Sri Lanka"), regionVal = c("Africa", "South America", "Africa", "North America", "South America", "Africa", "Asia"))
testDT[regionLookup, region := regionVal, on = c(country = "countrylookup")]
> testDT
         country        region
1:       Algeria        Africa
2:         Egypt        Africa
3: United States North America
4:        Brazil South America

I'd like to have testDTcomplicated look like the following

> testDT
             country                          region
    1:       Algeria, Ghana, Sri Lanka        Africa, Africa, Asia
    2:       Egypt                            Africa
    3: United States, Argentina, Brazil       North America, South America, South America
    4:        Brazil                          South America

Upvotes: 1

Views: 66

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

You could split the data on comma and get each country in a separate row, join the data with regionLookup and collapse them again in one value in a comma-separated string.

library(data.table)

testDTcomplicated[, row := seq_len(.N)]
new <- splitstackshape::cSplit(testDTcomplicated, 'country', ',', 
        direction = 'long')[regionLookup, region := regionVal, 
         on = c(country = "countrylookup")]
new <- new[, lapply(.SD, toString), row][,row:=NULL]

new
#                     country                       region
#1: Algeria, Ghana, Sri Lanka         Africa, Africa, Asia
#2:                     Egypt                       Africa
#3:  United States, Argentina North America, South America
#4:                    Brazil                South America

Same logic in dplyr can be implemented as :

library(dplyr)

testDTcomplicated %>%
   mutate(row = row_number()) %>%
   tidyr::separate_rows(country, sep = ", ") %>%
   left_join(regionLookup, by =  c("country" = "countrylookup")) %>%
   group_by(row) %>%
   summarise(across(.fns = toString))

Upvotes: 3

Related Questions