Chris
Chris

Reputation: 313

Combining data.table columns that include NAs

I have a set of five columns in a data table.

dt <- data.table(
  city = c(rep(1,2), rep(2,2), rep(3,2), rep(4,2)),
  neighborhoods.1 = c(NA, "a", "b", "c", NA, NA, "d", "e"),
  neighborhoods.2 = c(NA, "f", "g", rep(NA,5)),
  neighborhoods.3 = c(NA, "h", rep(NA, 6)),
  irrelevantdata = c(1:8)
)

   city neighborhoods.1 neighborhoods.2 neighborhoods.3 irrelevantdata
1:    1              NA              NA              NA              1
2:    1               a               f               h              2
3:    2               b               g              NA              3
4:    2               c              NA              NA              4
5:    3              NA              NA              NA              5
6:    3              NA              NA              NA              6
7:    4               d              NA              NA              7
8:    4               e              NA              NA              8

I want to combine the first four columns into a single column.

   neighborhood
1:    1
2:    1-a-f-h
3:    2-b-g
4:    2-c
5:    3
6:    3
7:    4-d
8:    4-e

As you can see, I'm removing NA records and separating with a -.

I've tried this, which has obvious problems in processing j:

business[
    , 
    neighborhood = paste0(
      city,
      if(!is.na(neighborhoods.1)) paste0("-", neighborhoods.1),
      if(!is.na(neighborhoods.2)) paste0("-", neighborhoods.2),
      if(!is.na(neighborhoods.3)) paste0("-", neighborhoods.3),       
      ""
    )
]

How can I get this done?


Updated to reflect that there are additional columns I don't want to combine.

Upvotes: 4

Views: 297

Answers (1)

akrun
akrun

Reputation: 887118

One option would be to paste the elements in rows together using do.call, and then remove the NA elements along with the extra - in the output vector.

dt[,.(neighborhood = gsub('-NA|NA-', '', 
   do.call(paste, c(.SD, sep='-')))), .SDcols= city:neighborhoods.3]

Or another option is group by sequence of rows, unlist the Subset of Data.table (.SD), remove the NA elements (na.omit), paste the elements together. We can specify the columns to be used for this operation in .SDcols.

dt[, .(neighbourhood = paste(na.omit(unlist(.SD)),collapse='-')) , 
              by=1:nrow(dt), .SDcols= city:neighborhoods.3]

Or another option suggested by @Frank is to melt the subset of dataset (specified by the columns that are needed) to long format and then paste

 mycols <- setdiff(names(dt), 'irrelevantdata')
 na.omit(melt(dt[, ..mycols][, r := .I], 
      id.var="r"))[, paste(value, collapse="-"), by=r]

Upvotes: 5

Related Questions