SJDS
SJDS

Reputation: 1289

Apply function across multiple columns

Please find here a very small subset of a long data.table I am working with

dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L, 
    4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L, 
    4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L, 
    101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L, 
    111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 
    4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor", 
    "KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L, 
    1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L, 
    0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L, 
    0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L, 
    0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L, 
    0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L, 
    0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L, 
    3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id", 
    "pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn", 
    "age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA, 
    -15L))

I am looking to apply a tweaked greater than comparison in 5 different columns.

Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.

So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons. The number of comparisons is by design 5 * (n_inv -1). The result I am looking for for row 1 should then be 10 / 10 = 1.

For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.

For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.

Thus note the difference: There are three different dyadic comparisons

1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0

Desired result An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2

Any suggestions on how to compute this efficiently?

Thanks!

Upvotes: 1

Views: 1017

Answers (2)

eddi
eddi

Reputation: 49448

vars = grep('_kn', names(dt), value = T)

# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
  dt[, res := res + get(var) / .N, by = c('pnum', var)]

# normalize
dt[, res := res/sum(res), by = pnum]
#    id    pnum invid    fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1:  1 4298390    15   CORN      1       1      1      1      1         5     3 1.0
# 2:  2 4298390   101   CORN      0       0      0      0      0         0     3 0.0
# 3:  3 4298390   102   CORN      0       0      0      0      0         0     3 0.0
# 4:  4 4298558   103 DowCor      0       0      0      1      1         2     2 0.2
# 5:  5 4298558   104 DowCor      1       1      1      1      1         5     2 0.8
# 6:  6 4298559   103 DowCor      0       0      0      1      1         2     2 0.2
# 7:  7 4298559   104 DowCor      1       1      1      1      1         5     2 0.8
# 8:  8 4299026   106  Texas      0       0      0      0      0         0     4 NaN
# 9:  9 4299026   107  Texas      0       0      0      0      0         0     4 NaN
#10: 10 4299026   108  Texas      0       0      0      0      0         0     4 NaN
#11: 11 4299026   109  Texas      0       0      0      0      0         0     4 NaN
#12: 12 4300436    87    KIM      1       1      1      1      1         5     2 1.0
#13: 13 4300436   111    KIM      0       0      0      0      0         0     2 0.0
#14: 14 4303566     2 DowCor      1       1      1      1      1         5     2 0.8
#15: 15 4303566    60 DowCor      1       0      0      1      0         2     2 0.2

Dealing with the above NaN case (arguably the correct answer), is left to the reader.

Upvotes: 6

jeremycg
jeremycg

Reputation: 24945

Here's a fastish solution using dplyr:

library(dplyr)
dt %>%
 group_by(pnum) %>% # group by pnum
 mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
 #give a 1 if the value is the max, and not 0. Only for the column with kn
 mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
 #correct for multiple maximums
 select(ends_with('kn')) %>%
 #remove all non kn columns
 do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
 #make a new data frame with x = rowsums for each indvidual
 # and y the colusums
 mutate(out = x/y)
 #divide by y (we could just use /5 if we always have five columns)

giving your desired output in the column out:

Source: local data frame [15 x 4]
Groups: pnum [6]

      pnum     x     y   out
     (int) (dbl) (dbl) (dbl)
1  4298390     5     5   1.0
2  4298390     0     5   0.0
3  4298390     0     5   0.0
4  4298558     1     5   0.2
5  4298558     4     5   0.8
6  4298559     1     5   0.2
7  4298559     4     5   0.8
8  4299026   NaN   NaN   NaN
9  4299026   NaN   NaN   NaN
10 4299026   NaN   NaN   NaN
11 4299026   NaN   NaN   NaN
12 4300436     5     5   1.0
13 4300436     0     5   0.0
14 4303566     4     5   0.8
15 4303566     1     5   0.2

The NaNs come from the groups with no winners, convert them back using eg:

x[is.na(x)] <- 0

Upvotes: 1

Related Questions