Reputation: 77
I have a dataframe called result
having 4 columns (x,y, label, NN.idx and dist) respectively representing the position of an observation in the plane, a label for avoiding (x,y) duplicates (see my remark below) the index of its nearest neighbour in another dataframe and distance to it.
Remark : Each (x,y) combination may appear one to three times, and if so, these are distinguished by a different label (e.g. rows 1,4 and 5 and in the example below). Also, note that two different points may have the same label, which is a quantity I calculated from previous data manipulation, e.g. both rows 1 and 3 have the same label while they clearly not represent the same point (x,y).
Here is an example :
result <- data.frame(x=c(0.147674, 0.235356 ,0.095337, 0.147674, 0.147674, 1.000000, 2.000000), y=c(0.132956, 0.150813, 0.087345, 0.132956, 0.132956, 2.000000, 1.000000), label = c(5,6,5,6,7,3,9), NN.idx =c(4325,2703,21282,3460,12,4,10), dist=c(0.02391247,0.03171236,0.01760940,0.03136304, 0.02315468, 0.01567365, 0.02314860))
head(result)
x y label NN.idx dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
5 0.147674 0.132956 7 12 0.02315468
6 1.000000 2.000000 3 4 5.00000000
7 2.000000 1.000000 9 10 11.00000000
What I would like to do is reshaping this dataframe very efficiently (the actual dataframe being much much larger) to a wide format where each row corresponds to a unique (x,y) combination and would present columns NN.idx_1, NN.idx_2, NN.idx_3, dist_1, dist_2, dist_3 giving the NN.idx and dist for each occurrence of the (x,y) combination in the original dataframe (and filling with NA if the (x,y) combination only appears twice or once)
I am relatively new to R and only know the basics, but I think I might have a solution using data.table
and dcast
as follows:
df <- setDT(result)
df[,NN.counter := 1:.N, by=c("x","y")]
df <- dcast(df, x+y~ NN.counter, value.var=c("NN.idx","dist"))
head(df)
x y NN.idx_1 NN.idx_2 NN.idx_3 dist_1 dist_2 dist_3
1: 0.095337 0.087345 21282 NA NA 0.01760940 NA NA
2: 0.147674 0.132956 4325 3460 12 0.02391247 0.03136304 0.02315468
3: 0.235356 0.150813 2703 NA NA 0.03171236 NA NA
4: 1.000000 2.000000 4 NA NA 0.01567365 NA NA
5: 2.000000 1.000000 10 NA NA 0.02314860 NA NA
My question is the following: is my approach ok? I am not familiar with dcast
and the notation x+y ~ NN.counter
makes me wonder whether two different points (x,y) resulting in the same sum x+y would be considered as different (e.g. rows 6 and 7 of my original dataframe, where x and y are reversed). Apparently it seems to work.
Does anyone have a better approach to deal this duplicate issue or is mine ok? Also, I don't know if this is reasonably fast or not, though I've read that data.table
is pretty fast.
Upvotes: 0
Views: 71
Reputation: 160447
Since both x
and y
are both numeric
, you might run into problems based on floating-point precision (i.e., R FAQ 7.31 and IEEE-754). While it might work, I don't know that I would strictly rely on it (without a lot of verification). It might be useful (for the purpose of reshaping) to coerce to fixed-length strings (e.g., sprintf("%0.06f", x)
) before grouping and dcast
ing.
Here's a thought that does that workaround. (Note: I'm using magrittr
solely to break out steps with the %>%
pipe, it is not required to function.)
library(data.table)
library(magrittr)
result <- data.table(x=c(0.147674, 0.235356 ,0.095337, 0.147674, 0.147674, 1.000000, 2.000000), y=c(0.132956, 0.150813, 0.087345, 0.132956, 0.132956, 2.000000, 1.000000), label = c(5,6,5,6,7,3,9), NN.idx =c(4325,2703,21282,3460,12,4,10), dist=c(0.02391247,0.03171236,0.01760940,0.03136304, 0.02315468, 0.01567365, 0.02314860))
result[, c("x_s", "y_s") := lapply(.(x, y), sprintf, fmt = "%0.09f") ]
savexy <- unique(result[, .(x, y, x_s, y_s) ]) # merge back in later with "real" numbers
result2 <- copy(result) %>%
.[, c("x", "y") := NULL ] %>%
.[, NN.counter := seq_len(.N), by = c("x_s", "y_s") ] %>%
dcast(x_s + y_s ~ NN.counter, value.var = c("NN.idx", "dist") ) %>%
merge(., savexy, by = c("x_s", "y_s"), all.x = TRUE) %>%
.[, c("x_s", "y_s") := NULL ] %>%
setcolorder(., c("x", "y"))
result2
# x y NN.idx_1 NN.idx_2 NN.idx_3 dist_1 dist_2 dist_3
# 1: 0.095337 0.087345 21282 NA NA 0.01760940 NA NA
# 2: 0.147674 0.132956 4325 3460 12 0.02391247 0.03136304 0.02315468
# 3: 0.235356 0.150813 2703 NA NA 0.03171236 NA NA
# 4: 1.000000 2.000000 4 NA NA 0.01567365 NA NA
# 5: 2.000000 1.000000 10 NA NA 0.02314860 NA NA
Upvotes: 1