Reputation: 4941
I have this line of code in R :
newDF <-oldDF[subsettingColumns]
subsettingColumns
has a few column names that may or may not exist in oldDF
. If it doesn't exist, I want a column inserted in newDF
at same location with NA
's. How do I get this to work in R?
Upvotes: 3
Views: 441
Reputation: 609
Recently, I was burned by data.table::rbindlist()
screwing with my data, so while developing my own solution (preparing individual tables with different columns for merging), I found this question and answers. So let's do benchmarking!
First, I collected and reformatted solutions into nice portable functions.
# @lmo
subset_missing1 = function(x, select){
y = lapply(select, \(y) if(y %in% names(x)) x[[y]] else NA)
data.frame(y) |> setNames(select)
}
# by @Andre Elrico
subset_missing2 = function(x, select){
missing = setdiff(select, names(x))
temp = lapply(missing, \(y){df = data.frame(NA) |> setNames(y)})
temp = Reduce(cbind, c(list(x), temp))
temp[select]
}
# by @Santosh M.
subset_missing3 = function(x, select){
add_columns = function(y, cols){
cols = setdiff(cols, names(y))
y[cols] = NA
y
}
y = add_columns(x, select)
y[select]
}
# by @Ronak Shah
subset_missing4 = function(x, select){
y = x[intersect(select, names(x))]
y[setdiff(select, names(x))] = NA
y[select]
}
Now, some data. Lets use Iris and some random column.
select = c(names(iris), letters[1:5]) |> sample()
The sample here is important to test that all solutions give you data in the correct order.
microbenchmark::microbenchmark(
subset_missing1(iris, select),
subset_missing2(iris, select),
subset_missing3(iris, select),
subset_missing4(iris, select)
)
# Unit: microseconds
# expr min lq mean median uq
# subset_missing1(iris, select) 475.550 494.2320 508.67725 506.5595 518.1525
# subset_missing2(iris, select) 494.336 517.9085 532.92533 529.9205 545.6695
# subset_missing3(iris, select) 76.336 82.9365 88.24799 88.7685 93.1335
# subset_missing4(iris, select) 89.816 95.9970 106.16170 107.1715 111.8510
Looks like subset_missing3
and subset_missing4
are fastest.
What about memory consumption? We test that using another benchmarking package bench
:
bench::mark(
subset_missing1(iris, select),
subset_missing2(iris, select),
subset_missing3(iris, select),
subset_missing4(iris, select)
)
# # A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 subset_mi… 469.1µs 494.8µs 1986. 3.16KB 6.69 890 3 448ms
# 2 subset_mi… 506.6µs 544.8µs 1822. 3.16KB 9.36 779 4 427ms
# 3 subset_mi… 73.9µs 82.1µs 12059. 29.7KB 6.88 5255 3 436ms
# 4 subset_mi… 86.8µs 96µs 10192. 29.7KB 7.33 4173 3 409ms
# # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
So they are faster, but burn almost 10x as much memory!
So, iris
is fine, but it is also a very small dataset. My problem is quite a bit larger. So lets make some artificial data!
random_string = function(n, size = 5){
replicate(n, sample(c(letters, LETTERS, 0:9), size = size, replace=TRUE) |> paste0(collapse=""))
}
names = random_string(5000, size=10) |> unique() |> head(2000)
data = sapply(head(names, 1000), function(x){rnorm(100000)}) |> as.data.frame()
10 000
rows and 1 000
columns should do that. Plus 1 000
missing columns.
object.size(data) |> format("Gb")
# 0.7 Gb
That represents a largish traditional dataset.
Lets run first the bench
package:
bench::mark(
subset_missing1(data, names),
subset_missing2(data, names),
subset_missing3(data, names),
subset_missing4(data, names)
)
# # A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
# 1 subset_missing1(da… 171.13ms 191.24ms 5.32 397.3MB 3.54 3 2
# 2 subset_missing2(da… 217.63ms 235.9ms 4.31 416.26MB 2.87 3 2
# 3 subset_missing3(da… 2.62s 2.62s 0.382 3.62GB 0.382 1 1
# 4 subset_missing4(da… 3.01s 3.01s 0.332 3.62GB 0.664 1 2
# # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
# time <list>, gc <list>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.
Huh, so the two fastest solutions are now the slowest (4x as slow!) and consuming the most memory! More than 4x our dataset size!
microbenchmark::microbenchmark(
subset_missing1(data, names),
subset_missing2(data, names),
subset_missing3(data, names),
subset_missing4(data, names), times = 10
)
# Unit: milliseconds
# expr min lq mean median uq
# subset_missing1(data, names) 172.6486 180.9187 221.0674 196.6227 243.5631
# subset_missing2(data, names) 217.8651 226.4897 252.0694 240.5091 245.6220
# subset_missing3(data, names) 2729.3352 2839.4274 2951.2346 2928.4190 3078.8228
# subset_missing4(data, names) 2798.1522 2964.6042 3040.4599 3065.5795 3093.8116
Microbenchmark confirmed this. The solutions 3 and 4 corresponding to @Ronak Shah and @Santosh S. solution might be faster for a small dataset, but they blow up when running on a large one.
The winner (by a small margin) is the solution by @lmo.
Upvotes: 0
Reputation: 388982
Let's take an example :
df <- data.frame(a = c(1, 2, 3, 4), b = c(4, 5, 6, 7))
df
# a b
#1 1 4
#2 2 5
#3 3 6
#4 4 7
#Columns to take subset of
subsettingColumns <- c('a', 'd', 'e')
#Columns which are already present
cols <- subsettingColumns[subsettingColumns %in% names(df)]
#Add them in the new dataframe
newdf <- df[cols]
#Assign NA to the columns which are not defined in the original dataframe
newdf[setdiff(subsettingColumns, cols)] <- NA
newdf
# a d e
#1 1 NA NA
#2 2 NA NA
#3 3 NA NA
#4 4 NA NA
Upvotes: 6
Reputation: 38500
Building on the basic structure of Andre Elrico's answer, you can do the following:
newDf <- data.frame(sapply(subsettingColumns,
function(x) if(x %in% names(oldDF)) oldDF[[x]] else NA))
The first 6 lines of which are
head(newDf)
am IDontExist gear IAlsoDontExist
1 1 NA 4 NA
2 1 NA 4 NA
3 1 NA 4 NA
4 0 NA 3 NA
5 0 NA 3 NA
6 0 NA 3 NA
data
oldDF <- mtcars
subsettingColumns <- c("am","IDontExist","gear","IAlsoDontExist")
Upvotes: 2
Reputation: 2454
You can have a function
to add columns if it does not exist in data frame, as below:
AddColumn <- function(oldDF, subsettingColumns) {
addCol <-subsettingColumns[!subsettingColumns%in%names(oldDF)]
if(length(addCol)!=0) oldDF[addCol] <- NA
oldDF
}
Testing this function on a example data:
# Example data
oldDF <- data.frame(A = c(1, 2, 3, 4, 5), B = c(11, 12, 13, 14, 15))
AddColumn(oldDF, "testColumn")
# A B testColumn
#1 1 11 NA
#2 2 12 NA
#3 3 13 NA
#4 4 14 NA
#5 5 15 NA
AddColumn(oldDF, c("testColumn1", "testColumn2")
# A B testColumn1 testColumn2
#1 1 11 NA NA
#2 2 12 NA NA
#3 3 13 NA NA
#4 4 14 NA NA
#5 5 15 NA NA
Upvotes: 5
Reputation: 11480
Data
oldDF <- mtcars
subsettingColumns <- c("am","IDontExist","gear","IAlsoDontExist")
Get the unknown columns
unknownCol <- setdiff(subsettingColumns,intersect(names(mtcars),subsettingColumns))
tempDF <- lapply(unknownCol,function(x){df=data.frame(A=NA);names(df)=x;df})
oldDF <- Reduce(cbind,c(list(oldDF),tempDF))
newDF <- oldDF[subsettingColumns]
newDF
> head(newDF)
am IDontExist gear IAlsoDontExist
Mazda RX4 1 NA 4 NA
Mazda RX4 Wag 1 NA 4 NA
Datsun 710 1 NA 4 NA
Hornet 4 Drive 0 NA 3 NA
Hornet Sportabout 0 NA 3 NA
Valiant 0 NA 3 NA
>
Upvotes: 3