Reputation: 453
I have a dataset that has an ID variable and thousands of columns of averages. A reproducible example is below. For each ID, I would like to select the column name that contains the value closest to 0.50. If there is a tie, select the lowest value. Is there an efficient way to do this (preferably using dplyr or data.table)?
df = data.frame(ID = paste("ID", 1:1000, sep = ""),
matrix(rnorm(20000), nrow=10))
> df[1:5, 1:5]
ID X1 X2 X3 X4
1 ID1 -0.5532944 -1.20671805 0.75142048 0.56022595
2 ID2 -1.0083010 -0.01534611 1.53546691 -0.08762588
3 ID3 -0.1606776 -0.96947669 -0.38631278 -1.15647134
4 ID4 -0.5957471 -0.20918120 -0.05246698 -0.84235789
5 ID5 0.1569595 -0.62460245 -0.39454014 0.91089249
My goal is to have a dataframe with the ID variable and the column name that contains the value closest to 0.5 as well as the value.
ID T P
1 ID1 X10 0.5671
2 ID2 X100 0.4999
3 ID3 X34 0.5877
4 ID4 X21 0.5055
5 ID5 X15 0.4987
Upvotes: 2
Views: 733
Reputation: 42602
Here is a different approach which reshapes the dataset from wide to long format using melt()
.
# create sample data: ID has constant length, values are rounded to 3 digits
set.seed(2020)
df = data.frame(ID = sprintf("ID%04i", 1:1000),
matrix(round(rnorm(20000), 3), nrow=10))
target <- 0.5
library(data.table)
long <- melt(setDT(df), "ID")
long[, .SD[which.min(abs(value - target))], by = ID]
ID variable value 1: ID0001 X1924 0.501 2: ID0002 X1440 0.499 3: ID0003 X906 0.500 4: ID0004 X180 0.503 5: ID0005 X1757 0.498 --- 996: ID0996 X1568 0.500 997: ID0997 X565 0.501 998: ID0998 X613 0.502 999: ID0999 X1344 0.500 1000: ID1000 X1018 0.501
Now, the OP has requested to select the lower value in case of ties. This can be achieved by ordering:
long[order(ID, value), .SD[which.min(abs(value - target))], by = ID]
ID variable value 1: ID0001 X1924 0.501 2: ID0002 X1440 0.499 3: ID0003 X906 0.500 4: ID0004 X180 0.503 5: ID0005 X1757 0.498 --- 996: ID0996 X1568 0.500 997: ID0997 X565 0.501 998: ID0998 X613 0.502 999: ID0999 X1344 0.500 1000: ID1000 X1971 0.499
Note the difference in row 1000.
By chaining the data.table
expressions the statement can be written as a "one-liner":
melt(setDT(df), "ID")[order(ID, value), .SD[which.min(abs(value - target))], by = ID]
Also note that the sample dataset has been modified
set.seed()
is used to ensure that the generated random numbers are reproducible.sprintf("ID%04i", 1:1000)
instead of paste()
, ID
has a fixed length. This helps to maintain a consistent sort order.Upvotes: 3
Reputation: 5798
Base R solution which should always select the lower value in the event of a tie:
num_cols_idx <- which(sapply(df, is.numeric))
min_vec <- sapply(split(df, rownames(df)), function(x) {
sorted_named_vec <- sort(unlist(x[num_cols_idx]))
names(sorted_named_vec)[which.min(abs(sorted_named_vec - 0.5))]
},
simplify = TRUE)
Upvotes: 1
Reputation: 3923
I think this is what you want. Because your play data repeats itself I would check on some other more random data. Yes it uses the first "X" variable it finds which is what I assume you mean by "lowest.
library(dplyr)
set.seed(2020)
df <- data.frame(ID = paste("ID", 1:1000, sep = ""),
matrix(rnorm(20000), nrow=10))
results <- df %>%
rowwise %>%
summarise(ID = ID,
col_index = which.min(abs(c_across(X1:X2000) - 0.5)) + 1,
whichcolumn = colnames(.[col_index]),
value = nth(c_across(X1:X2000), which.min(abs(c_across(X1:X2000) - .5))),
.groups = "rowwise")
results
#> # A tibble: 1,000 x 4
#> # Rowwise:
#> ID col_index whichcolumn value
#> <chr> <dbl> <chr> <dbl>
#> 1 ID1 1925 X1924 0.501
#> 2 ID2 1441 X1440 0.499
#> 3 ID3 907 X906 0.500
#> 4 ID4 181 X180 0.503
#> 5 ID5 1758 X1757 0.498
#> 6 ID6 1569 X1568 0.500
#> 7 ID7 566 X565 0.501
#> 8 ID8 1448 X1447 0.502
#> 9 ID9 1345 X1344 0.500
#> 10 ID10 1019 X1018 0.501
#> # … with 990 more rows
Upvotes: 1
Reputation: 4358
I've simplified the example code to the smaller subset to spare my processor as worked through the code:
data.frame(
ID = df[1:5,1],
T = apply(df[1:5, 2:5],1, function(x) colnames(df)[which.min(abs(x - 0.5))]),
P = apply(df[1:5, 2:5],1, function(x) x[which.min(abs(x - 0.5))])
)
Upvotes: 0