Reputation: 371
I often need to extract a single value from a dataframe, e.g. extract the value of "var1" for Gorilla gorilla:
example_data <- data.frame(species = c("Pan_troglodytes", "Gorilla_gorilla", "Pongo_pygmaeus"),
var1 = c(5.88, 6.07, 5.83),
var2 = c(10.6, 11.2, 10.5)
)
example_data
species var1 var2
1 Pan_troglodytes 5.88 10.6
2 Gorilla_gorilla 6.07 11.2
3 Pongo_pygmaeus 5.83 10.5
The way I learnt to do this with base R is with:
example_data[[which(example_data$species == "Gorilla_gorilla"), "var1"]]
[1] 6.07
This method feels conceptually very efficient (identify location of value -> extract value) and is certainly computationally efficient (see benchmarking below). However, it doesn't seem very efficient from a coding standpoint: the name of the data variable has to be repeated, and this can lead to visually messy code, especially if the name of the data variable is long or if the method is used inside a function or within a block of other code, etc. Ideally, I'd prefer to use a tidyverse-style method that only requires you to state the data variable once. I know the same thing can be achieved with dplyr::filter()
:
library(dplyr)
filter(example_data, species == "Gorilla_gorilla")$var1
[1] 6.07
And this code is a lot cleaner, but it feels conceptually less efficient (perform operation on entire dataframe -> extract value) and is computationally slower (although we're talking microseconds, so this probably isn't really an issue).
library(microbenchmark)
Unit: microseconds
expr min lq mean median uq max neval
example_data[[which(example_data$species == "Gorilla_gorilla"), "var1"]] 6.301 8.8505 15.48891 15.9505 19.1510 77.701 100
filter(example_data, species == "Gorilla_gorilla")$var1 638.801 801.8005 904.89208 859.3505 936.8505 1782.901 100
My question is: is there another way of conditionally extracting single values from a dataframe, one that has the benefits of the tidyverse's clean code but is more direct like the base R way? Or am I just being silly and there's no reason not to just use the dplyr::filter()
method?
(Also, I know that I could set the "species" variable to rownames and then use example_data["Gorilla_gorilla", "var1"]
, which is both clean and efficient, but this would violate the tidyverse principle of not using row names).
It just seems like there's currently a gap in the tidyverse: it has dplyr::filter
as the equivalent of base data[n, ]
; dplyr::select
as the equivalent of base data[, n]
; but doesn't seem to have an equivalent of base data[n, n]
.
Upvotes: 2
Views: 4630
Reputation: 371
Thanks @hello_friend, that's a really useful survey and reference for people. For further reference, here's a benchmarking of the different methods (which is why I have to post this as an answer rather than as a comment). Again though, we're talking microseconds, so the differences may not be practically meaningful, though I think that they're quite interesting.
I think I'm actually going to stick with my original filter(example_data, species == "Gorilla_gorilla")$var1
. Having now been able to comparing it to all the other options it actually seems to hold up quite well, and it's just a hair more concise than what would be my second choice, subset(example_data, species == "Gorilla_gorilla", select = "var1")[[1]]
, which for me for practical purposes outweighs it being some microseconds slower.
Thanks for the tip about data.table
though @sindri_baldur, I hadn't come across it. It's interesting to see that it isn't really any faster for this particular operation, though, even when the conversion of the dataframe is done beforehand (DT_data[species == "Gorilla_gorilla", "var1"]
)
expr mean_time_microseconds package
1 "example_data$var1[match(\"Gorilla_gorilla\", example_data$species)]" 15413. base
2 "example_data[match(\"Gorilla_gorilla\", example_data$species), \"var1\", drop = FALSE]" 561586. base
3 "subset(example_data, species == \"Gorilla_gorilla\", select = \"var1\")" 579747. base
4 "subset(example_data, species == \"Gorilla_gorilla\", select = \"var1\")[[1]]" 604293. base
5 "filter(example_data, species == \"Gorilla_gorilla\")$var1" 1757617. dplyr
6 "example_data %>% filter(species == \"Gorilla_gorilla\") %>% pull(var1)" 2699485. dplyr
7 "DT_data[species == \"Gorilla_gorilla\", \"var1\"]" 2877631. data.table
8 "setDT(example_data)[species == \"Gorilla_gorilla\", \"var1\"]" 3044128. data.table
9 "setDT(example_data)[species == \"Gorilla_gorilla\", 2]" 3094515. data.table
10 "as.matrix(setDT(example_data)[species == \"Gorilla_gorilla\", 2])[1]" 3168429. data.table
11 "setDT(example_data)[species == \"Gorilla_gorilla\", .SD, .SDcols = 2]" 3533138. data.table
12 "example_data %>% slice(which(species == \"Gorilla_gorilla\")) %>% select(var1)" 5835452. dplyr
13 "example_data %>% filter(species == \"Gorilla_gorilla\") %>% select(var1)" 5882807. dplyr
Upvotes: 2
Reputation: 5788
There are many ways to skin a cat in R; here are lots (keep in mind you can continually mix and match alot of the functions used below) in general you have correctly observed the inefficiency of tidyverse functions:
# Base R method 1: => returns data.frame:
subset(example_data, species == "Gorilla_gorilla", select = "var1")
# Base R method 2: => returns vector length 1 (R's scalar):
example_data$var1[match("Gorilla_gorilla", example_data$species)]
# Base R method 3: => (result as df):
example_data[match("Gorilla_gorilla", example_data$species), "var1", drop = FALSE]
# Tidyverse method 1: => returns df
library(tidyverse)
example_data %>%
slice(which(species == "Gorilla_gorilla")) %>%
select(var1)
# Tidyverse method 2: => returns df
example_data %>%
filter(species == "Gorilla_gorilla") %>%
select(var1)
# Tidyverse method 3: => returns vector
example_data %>%
filter(species == "Gorilla_gorilla") %>%
pull(var1)
# data.table method 1: => returns data.table / data.frame object
library(data.table)
setDT(example_data)[species == "Gorilla_gorilla", "var1"]
# data.tabel method 2: => returns data.table / data.frame object:
setDT(example_data)[species == "Gorilla_gorilla", 2]
# data.tabel method 3: => returns data.table / data.frame object:
setDT(example_data)[species == "Gorilla_gorilla", .SD, .SDcols = 2]
# data.table method 4: => return vector:
as.matrix(setDT(example_data)[species == "Gorilla_gorilla", 2])[1]
Upvotes: 5
Reputation: 33498
Are you familiar with data.table
?
It meets your three criteria of being concise, (really) efficient, and similar to base R syntax:
library(data.table)
setDT(example_data) # Convert to data.table, convert back with setDF()
example_data[species == "Gorilla_gorilla", var1]
# [1] 6.07
If you do this repeatedly you can set a key to avoid typing the name of the filter variable:
setkey(example_data, species)
example_data["Gorilla_gorilla", var1]
# [1] 6.07
example_data["Pongo_pygmaeus", var1]
# [1] 5.83
Upvotes: 4
Reputation: 10365
The complete tidyverse
way would be to use pull
to only get a variable and not a complete data.frame:
example_data %>%
filter(species == "Gorilla_gorilla") %>%
pull(var1)
[1] 6.07
However, I haven't tested it for computational efficiency.
Upvotes: 3