Reputation: 4941

Subset a dataframe with non-existing column names

I have this line of code in R :

newDF <-oldDF[subsettingColumns]

subsettingColumns has a few column names that may or may not exist in oldDF. If it doesn't exist, I want a column inserted in newDF at same location with NA's. How do I get this to work in R?

Upvotes: 3

Answers (5)

Colombo

Reputation: 609

Recently, I was burned by data.table::rbindlist() screwing with my data, so while developing my own solution (preparing individual tables with different columns for merging), I found this question and answers. So let's do benchmarking!

First, I collected and reformatted solutions into nice portable functions.


# @lmo
subset_missing1 = function(x, select){
    y = lapply(select, \(y) if(y %in% names(x)) x[[y]] else NA)
    data.frame(y) |> setNames(select)
    }

# by @Andre Elrico
subset_missing2 = function(x, select){
    missing = setdiff(select, names(x))
    temp = lapply(missing, \(y){df = data.frame(NA) |> setNames(y)})
    temp = Reduce(cbind, c(list(x), temp))
    temp[select]
    }

# by @Santosh M.
subset_missing3 = function(x, select){
    add_columns = function(y, cols){
        cols = setdiff(cols, names(y))
        y[cols] = NA
        y
        }

    y = add_columns(x, select)
    y[select]
    }

# by @Ronak Shah
subset_missing4 = function(x, select){
    y = x[intersect(select, names(x))]
    y[setdiff(select, names(x))] = NA
    y[select]
    }

Now, some data. Lets use Iris and some random column.

select = c(names(iris), letters[1:5]) |> sample()

The sample here is important to test that all solutions give you data in the correct order.

microbenchmark::microbenchmark(
    subset_missing1(iris, select),
    subset_missing2(iris, select),
    subset_missing3(iris, select),
    subset_missing4(iris, select)
    )

# Unit: microseconds
#                           expr     min       lq      mean   median       uq
#  subset_missing1(iris, select) 475.550 494.2320 508.67725 506.5595 518.1525
#  subset_missing2(iris, select) 494.336 517.9085 532.92533 529.9205 545.6695
#  subset_missing3(iris, select)  76.336  82.9365  88.24799  88.7685  93.1335
#  subset_missing4(iris, select)  89.816  95.9970 106.16170 107.1715 111.8510

Looks like subset_missing3 and subset_missing4 are fastest.

What about memory consumption? We test that using another benchmarking package bench:

bench::mark(
    subset_missing1(iris, select),
    subset_missing2(iris, select),
    subset_missing3(iris, select),
    subset_missing4(iris, select)
    )
# # A tibble: 4 × 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 subset_mi… 469.1µs 494.8µs     1986.    3.16KB     6.69   890     3      448ms
# 2 subset_mi… 506.6µs 544.8µs     1822.    3.16KB     9.36   779     4      427ms
# 3 subset_mi…  73.9µs  82.1µs    12059.    29.7KB     6.88  5255     3      436ms
# 4 subset_mi…  86.8µs    96µs    10192.    29.7KB     7.33  4173     3      409ms
# # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

So they are faster, but burn almost 10x as much memory!

So, iris is fine, but it is also a very small dataset. My problem is quite a bit larger. So lets make some artificial data!

random_string = function(n, size = 5){
    replicate(n, sample(c(letters, LETTERS, 0:9), size = size, replace=TRUE) |> paste0(collapse=""))
    }

names = random_string(5000, size=10) |> unique() |> head(2000)
data = sapply(head(names, 1000), function(x){rnorm(100000)}) |> as.data.frame()

10 000 rows and 1 000 columns should do that. Plus 1 000 missing columns.

object.size(data) |> format("Gb")
# 0.7 Gb

That represents a largish traditional dataset.

Lets run first the bench package:

bench::mark(
    subset_missing1(data, names),
    subset_missing2(data, names),
    subset_missing3(data, names),
    subset_missing4(data, names)
    )
# # A tibble: 4 × 13
#   expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 subset_missing1(da… 171.13ms 191.24ms     5.32    397.3MB    3.54      3     2
# 2 subset_missing2(da… 217.63ms  235.9ms     4.31   416.26MB    2.87      3     2
# 3 subset_missing3(da…    2.62s    2.62s     0.382    3.62GB    0.382     1     1
# 4 subset_missing4(da…    3.01s    3.01s     0.332    3.62GB    0.664     1     2
# # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#   time <list>, gc <list>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.

Huh, so the two fastest solutions are now the slowest (4x as slow!) and consuming the most memory! More than 4x our dataset size!

microbenchmark::microbenchmark(
    subset_missing1(data, names),
    subset_missing2(data, names),
    subset_missing3(data, names),
    subset_missing4(data, names), times = 10
    )

# Unit: milliseconds
#                          expr       min        lq      mean    median        uq
#  subset_missing1(data, names)  172.6486  180.9187  221.0674  196.6227  243.5631
#  subset_missing2(data, names)  217.8651  226.4897  252.0694  240.5091  245.6220
#  subset_missing3(data, names) 2729.3352 2839.4274 2951.2346 2928.4190 3078.8228
#  subset_missing4(data, names) 2798.1522 2964.6042 3040.4599 3065.5795 3093.8116

Microbenchmark confirmed this. The solutions 3 and 4 corresponding to @Ronak Shah and @Santosh S. solution might be faster for a small dataset, but they blow up when running on a large one.

The winner (by a small margin) is the solution by @lmo.

Upvotes: 0

Ronak Shah

Reputation: 388982

Let's take an example :

df <- data.frame(a = c(1, 2, 3, 4), b = c(4, 5, 6, 7))
df

#  a b
#1 1 4
#2 2 5
#3 3 6
#4 4 7

#Columns to take subset of
subsettingColumns <- c('a', 'd', 'e')

#Columns which are already present
cols <- subsettingColumns[subsettingColumns %in% names(df)]

#Add them in the new dataframe
newdf <- df[cols]

#Assign NA to the columns which are not defined in the original dataframe
newdf[setdiff(subsettingColumns, cols)] <- NA

newdf
#  a  d  e
#1 1 NA NA
#2 2 NA NA
#3 3 NA NA
#4 4 NA NA

Upvotes: 6

lmo

Reputation: 38500

Building on the basic structure of Andre Elrico's answer, you can do the following:

newDf <- data.frame(sapply(subsettingColumns,
                           function(x) if(x %in% names(oldDF)) oldDF[[x]] else NA))

The first 6 lines of which are

head(newDf)
  am IDontExist gear IAlsoDontExist
1  1         NA    4             NA
2  1         NA    4             NA
3  1         NA    4             NA
4  0         NA    3             NA
5  0         NA    3             NA
6  0         NA    3             NA

data

oldDF <- mtcars
subsettingColumns <- c("am","IDontExist","gear","IAlsoDontExist")

Upvotes: 2

Santosh M.

Reputation: 2454

You can have a function to add columns if it does not exist in data frame, as below:

AddColumn <- function(oldDF, subsettingColumns) {

  addCol <-subsettingColumns[!subsettingColumns%in%names(oldDF)]

  if(length(addCol)!=0) oldDF[addCol] <- NA
    oldDF
}

Testing this function on a example data:

# Example data
oldDF <- data.frame(A = c(1, 2, 3, 4, 5), B = c(11, 12, 13, 14, 15))

AddColumn(oldDF, "testColumn")

#   A   B  testColumn
#1  1  11         NA
#2  2  12         NA
#3  3  13         NA
#4  4  14         NA
#5  5  15         NA

AddColumn(oldDF, c("testColumn1", "testColumn2")

#  A   B  testColumn1  testColumn2
#1 1  11           NA          NA
#2 2  12           NA          NA
#3 3  13           NA          NA
#4 4  14           NA          NA
#5 5  15           NA          NA

Upvotes: 5

Andre Elrico

Reputation: 11480

Data

oldDF <- mtcars
subsettingColumns <- c("am","IDontExist","gear","IAlsoDontExist")

Get the unknown columns

unknownCol <- setdiff(subsettingColumns,intersect(names(mtcars),subsettingColumns))

tempDF <- lapply(unknownCol,function(x){df=data.frame(A=NA);names(df)=x;df})
oldDF <- Reduce(cbind,c(list(oldDF),tempDF))

newDF <- oldDF[subsettingColumns]
newDF

Result

> head(newDF)
                  am IDontExist gear IAlsoDontExist
Mazda RX4          1         NA    4             NA
Mazda RX4 Wag      1         NA    4             NA
Datsun 710         1         NA    4             NA
Hornet 4 Drive     0         NA    3             NA
Hornet Sportabout  0         NA    3             NA
Valiant            0         NA    3             NA
>

Upvotes: 3

Subset a dataframe with non-existing column names

Answers (5)

Result

Related Questions