Mark_1
Mark_1

Reputation: 371

Extract single value from dataframe based on logical condition, Tidyverse style

I often need to extract a single value from a dataframe, e.g. extract the value of "var1" for Gorilla gorilla:

example_data <- data.frame(species = c("Pan_troglodytes", "Gorilla_gorilla", "Pongo_pygmaeus"),
                           var1 = c(5.88, 6.07, 5.83),
                           var2 = c(10.6, 11.2, 10.5)
                           )
example_data

          species var1 var2
1 Pan_troglodytes 5.88 10.6
2 Gorilla_gorilla 6.07 11.2
3  Pongo_pygmaeus 5.83 10.5

The way I learnt to do this with base R is with:

example_data[[which(example_data$species == "Gorilla_gorilla"), "var1"]]

[1] 6.07

This method feels conceptually very efficient (identify location of value -> extract value) and is certainly computationally efficient (see benchmarking below). However, it doesn't seem very efficient from a coding standpoint: the name of the data variable has to be repeated, and this can lead to visually messy code, especially if the name of the data variable is long or if the method is used inside a function or within a block of other code, etc. Ideally, I'd prefer to use a tidyverse-style method that only requires you to state the data variable once. I know the same thing can be achieved with dplyr::filter():

library(dplyr)

filter(example_data, species == "Gorilla_gorilla")$var1

[1] 6.07

And this code is a lot cleaner, but it feels conceptually less efficient (perform operation on entire dataframe -> extract value) and is computationally slower (although we're talking microseconds, so this probably isn't really an issue).

library(microbenchmark)

Unit: microseconds
                                                                          expr     min       lq      mean   median       uq      max neval
 example_data[[which(example_data$species == "Gorilla_gorilla"),      "var1"]]   6.301   8.8505  15.48891  15.9505  19.1510   77.701   100
                       filter(example_data, species == "Gorilla_gorilla")$var1 638.801 801.8005 904.89208 859.3505 936.8505 1782.901   100

My question is: is there another way of conditionally extracting single values from a dataframe, one that has the benefits of the tidyverse's clean code but is more direct like the base R way? Or am I just being silly and there's no reason not to just use the dplyr::filter() method?

(Also, I know that I could set the "species" variable to rownames and then use example_data["Gorilla_gorilla", "var1"], which is both clean and efficient, but this would violate the tidyverse principle of not using row names).

It just seems like there's currently a gap in the tidyverse: it has dplyr::filter as the equivalent of base data[n, ]; dplyr::select as the equivalent of base data[, n]; but doesn't seem to have an equivalent of base data[n, n].

Upvotes: 2

Views: 4630

Answers (4)

Mark_1
Mark_1

Reputation: 371

Thanks @hello_friend, that's a really useful survey and reference for people. For further reference, here's a benchmarking of the different methods (which is why I have to post this as an answer rather than as a comment). Again though, we're talking microseconds, so the differences may not be practically meaningful, though I think that they're quite interesting.

I think I'm actually going to stick with my original filter(example_data, species == "Gorilla_gorilla")$var1. Having now been able to comparing it to all the other options it actually seems to hold up quite well, and it's just a hair more concise than what would be my second choice, subset(example_data, species == "Gorilla_gorilla", select = "var1")[[1]], which for me for practical purposes outweighs it being some microseconds slower.

Thanks for the tip about data.table though @sindri_baldur, I hadn't come across it. It's interesting to see that it isn't really any faster for this particular operation, though, even when the conversion of the dataframe is done beforehand (DT_data[species == "Gorilla_gorilla", "var1"])

   expr                                                                                          mean_time_microseconds package     
 1 "example_data$var1[match(\"Gorilla_gorilla\", example_data$species)]"                                         15413. base      
 2 "example_data[match(\"Gorilla_gorilla\", example_data$species),      \"var1\", drop = FALSE]"                561586. base      
 3 "subset(example_data, species == \"Gorilla_gorilla\", select = \"var1\")"                                    579747. base      
 4 "subset(example_data, species == \"Gorilla_gorilla\", select = \"var1\")[[1]]"                               604293. base      
 5 "filter(example_data, species == \"Gorilla_gorilla\")$var1"                                                 1757617. dplyr     
 6 "example_data %>% filter(species == \"Gorilla_gorilla\") %>% pull(var1)"                                    2699485. dplyr     
 7 "DT_data[species == \"Gorilla_gorilla\", \"var1\"]"                                                         2877631. data.table
 8 "setDT(example_data)[species == \"Gorilla_gorilla\", \"var1\"]"                                             3044128. data.table
 9 "setDT(example_data)[species == \"Gorilla_gorilla\", 2]"                                                    3094515. data.table
10 "as.matrix(setDT(example_data)[species == \"Gorilla_gorilla\", 2])[1]"                                      3168429. data.table
11 "setDT(example_data)[species == \"Gorilla_gorilla\", .SD, .SDcols = 2]"                                     3533138. data.table
12 "example_data %>% slice(which(species == \"Gorilla_gorilla\")) %>%      select(var1)"                       5835452. dplyr     
13 "example_data %>% filter(species == \"Gorilla_gorilla\") %>% select(var1)"                                  5882807. dplyr

enter image description here

Upvotes: 2

hello_friend
hello_friend

Reputation: 5788

There are many ways to skin a cat in R; here are lots (keep in mind you can continually mix and match alot of the functions used below) in general you have correctly observed the inefficiency of tidyverse functions:

# Base R method 1: => returns data.frame:
subset(example_data, species == "Gorilla_gorilla", select = "var1")

# Base R method 2: => returns vector length 1 (R's scalar): 
example_data$var1[match("Gorilla_gorilla", example_data$species)]

# Base R method 3: => (result as df): 
example_data[match("Gorilla_gorilla", example_data$species), "var1", drop = FALSE]

# Tidyverse method 1: => returns df
library(tidyverse)
example_data %>% 
  slice(which(species == "Gorilla_gorilla")) %>% 
  select(var1)

# Tidyverse method 2: => returns df
example_data %>% 
  filter(species == "Gorilla_gorilla") %>% 
  select(var1)

# Tidyverse method 3: => returns vector
example_data %>% 
  filter(species == "Gorilla_gorilla") %>% 
  pull(var1)

# data.table method 1: => returns data.table / data.frame object
library(data.table)
setDT(example_data)[species == "Gorilla_gorilla", "var1"]

# data.tabel method 2: => returns data.table / data.frame object: 
setDT(example_data)[species == "Gorilla_gorilla", 2]

# data.tabel method 3: => returns data.table / data.frame object: 
setDT(example_data)[species == "Gorilla_gorilla", .SD, .SDcols = 2]

# data.table method 4: => return vector: 
as.matrix(setDT(example_data)[species == "Gorilla_gorilla", 2])[1]

Upvotes: 5

s_baldur
s_baldur

Reputation: 33498

Are you familiar with data.table?

It meets your three criteria of being concise, (really) efficient, and similar to base R syntax:

library(data.table)
setDT(example_data) # Convert to data.table, convert back with setDF()
example_data[species == "Gorilla_gorilla", var1]
# [1] 6.07

If you do this repeatedly you can set a key to avoid typing the name of the filter variable:

setkey(example_data, species)

example_data["Gorilla_gorilla", var1]
# [1] 6.07
example_data["Pongo_pygmaeus", var1]
# [1] 5.83

Upvotes: 4

starja
starja

Reputation: 10365

The complete tidyverse way would be to use pull to only get a variable and not a complete data.frame:

example_data %>% 
  filter(species == "Gorilla_gorilla") %>% 
  pull(var1)
[1] 6.07

However, I haven't tested it for computational efficiency.

Upvotes: 3

Related Questions