Recoding categorical variables to the most common value

Question

This has probably been answered already but I honestly can't find it if it has.

I have tens of millions of observations of cost, a patient ID, gender, region and age cat.

I need average cost per person (ID) with one cost entry for each unique ID number but sometimes the same individual pops up in more than one age category or region. I need a piece of code to assign the most representative age and region to the data associated with each ID user.

As an example, ID user 1 has a birthday during the year and moves from age category 15-24 to 25-34 but I'd like to assign 15-24 to that user because it best describes him. Again, ID user 3 pops up in regions A, D and E but I'd like to assign E to him because it's the most representative region for that user.

If someone could suggest code to assign the most common values in a user's rows to all of those rows I'd be grateful. I can take it from there myself.

Thanks so much

Here's a table of what I have and then a table of what I want. Notice that individual 1 was also entered as female for the final observation so 'three' changes were required to this sample. This happens numerous times in the data set.

ID  Cost    Sex Age Cat Region
1   5.68    M   15-24   A
1   5.63    M   15-24   A
1   5.87    M   15-24   A
1   6.32    M   25-34   A
1   6.45    F   25-34   B
2   2.34    F   65-74   C
2   3.52    F   65-74   C
2   9.85    F   65-74   C
3   7.52    M   35-44   A
3   6.52    M   35-44   D
3   5.32    M   35-44   E
3   2.12    M   35-44   E

ID  Cost    Sex Age Cat Region
1   5.68    M   15-24   A
1   5.63    M   15-24   A
1   5.87    M   15-24   A
1   6.32    M   15-24   A
1   6.45    M   15-24   A
2   2.34    F   65-74   C
2   3.52    F   65-74   C
2   9.85    F   65-74   C
3   7.52    M   35-44   E
3   6.52    M   35-44   E
3   5.32    M   35-44   E
3   2.12    M   35-44   E

Jthorpe · Accepted Answer

I think the general answer is that you want to use the split-apply pattern using a function like aggregate or `tapply'. Here's a solution using tapply:

df <- data.frame(id = c(1,1,1,2,2,3),
                 f=factor(c('foo','foo','bar','foo','bar','bar')))
df
#> id   f
#>  1 foo
#>  1 foo
#>  1 bar
#>  2 foo
#>  2 bar
#>  3 bar


# initialize a data.frame with one record per id
df2 <- data.frame(id = unique(df$id))

# identify the most frequent factor level for each person
tmp <- tapply(df$f, # the variable to be summarized
              df$id,# the variable on whith to group the summaries
              function(x){
                  # tabulate the value of F
                  tbl <- table(x)
                  # return (the first of) the most frequent value(s)
                  names(tbl)[which(tbl == max(tbl))[1]]
              })
df2$f <- tmp[match(df2$id,names(tmp))]

df2
#>   id   f
#> 1  1 foo
#> 2  2 bar
#> 3  3 bar

Note that you have to be careful in breaking ties -- you'll notice the 'first value' returned in this example was the first in alphabetical order, since that's the ordering of the factor levels an hence the ordering in the table tbl...

dplyr and data.table solutions will run faster because they create indexes on the grouping variables and then use those indexes to identify related sets of records. That said, there are two problems that come to mind with a dataset with >1.6M records: Running time and Memory. You may want to benchmark the time to create the index by using a trivial summary function like function(x)0, and then know in that the creating the index is order n*log(n) you can calculate the running time for the full set (provided you're not running out of memory).

For this task, I suggest using data.table using setkey(DT,id) to index the id column before aggregating the data. There are plenty of examples of this type of task with data.table, including the introductory vignette (see vignette('datatable-intro'))

Recoding categorical variables to the most common value

Answers (1)

Related Questions