Reputation: 41
What are the differences between the data.frame
, tribble
, and tibble
functions? Which is easier and which is more useful for analyzing lots of data? I'm creating a data frame and I don't know which one to choose.
Upvotes: 4
Views: 2311
Reputation: 27060
A data frame is a table-like structure with rows and columns, where each column can hold different types of values. Its purpose is similar to spreadsheets or SQL tables. Below there's a simple example to illustrate.
Suppose you have data about people, including their name, age, and employment status. You can store this data in vectors like this:
> names <- c('John', 'Sylvia', 'Arthemis')
> age <- c(32, 16, 21)
> employed <- c(TRUE, FALSE, TRUE)
A data frame allows us to combine all the data related to a person in one row. To create it, we pass the vectors as arguments to data.frame()
:
> df <- data.frame(Name = names, Age = age, Working = employed)
> df
Name Age Working
1 John 32 TRUE
2 Sylvia 16 FALSE
3 Arthemis 21 TRUE
This format makes the data much clearer. Data frames also simplify many operations. For example, filtering:
> df[df$Age > 20, ]
Name Age Working
1 John 32 TRUE
3 Arthemis 21 TRUE
Data frames are versatile and useful for many data manipulation tasks, including filtering, aggregating, and plotting.
Tibbles are a modern version of data frames, part of the tidyverse collection of packages. They offer some subtle improvements over traditional data frames.
One difference is that tibbles display more information:
> t <- tibble(Name = names, Age = age, Working = employed)
> t
# A tibble: 3 × 3
Name Age Working
<chr> <dbl> <lgl>
1 John 32 TRUE
2 Sylvia 16 FALSE
3 Arthemis 21 TRUE
Tibbles also avoid some confusing features found in data frames. For instance, with data frames, you can access a column using only the beginning of its name:
> df$N
[1] "John" "Sylvia" "Arthemis"
It may seem practical, this can lead to hard-to-understand code and potential bugs if multiple columns share a prefix. In contrast, tibbles will return NULL
and display a warning if you try this:
> t$N
NULL
Warning message:
Unknown or uninitialized column: `N`.
For more differences, see the Tibble documentation.
tribble()
FunctionWhile tibble() creates tibbles from vectors, tribble() offers a different way to create them, using a more readable syntax for defining columns and rows directly.
> t2 <- tribble(
+ ~Name, ~Age, ~`Employment status`,
+ "John", 32, TRUE,
+ "Sylvia", 16, FALSE,
+ "Arthemis", 21, TRUE
+ )
The tribble()
function is particularly useful for creating small datasets for examples and testing. However, the resulting object is the same as one created with tibble()
:
# A tibble: 3 × 3
Name Age `Employment status`
<chr> <dbl> <lgl>
1 John 32 TRUE
2 Sylvia 16 FALSE
3 Arthemis 21 TRUE
Both data frames and tibbles are useful, but some contexts may favor one over the other.
When creating tibbles, choose the function based on your data source:
tibble()
when reading data from files or vectors.tribble()
when entering hardcoded values.Although tibble()
and tribble()
return the same type of object, their argument structures differ drastically. Their similar names often cause confusion.
If you mistakenly use tibble()
with tribble()
arguments, you'll get an error like this:
> # ❌ WRONG!
> tibble(
~Name, ~Age, ~`Employment status`,
"John", 32, TRUE
)
Error:
! All columns in a tibble must be vectors.
✖ Column `~Name` is a `formula` object.
Run `rlang::last_error()` to see where the error occurred.
Conversely, using tribble()
with tibble()
arguments will result in:
> # ❌ WRONG!
> tribble(Name = names, Age = age, Working = employed)
Error:
! Must specify at least one column using the `~name` syntax.
Run `rlang::last_error()` to see where the error occurred.
If you encounter errors like these, double-check that you're using the correct function and argument structure.
(I'm posting this addendum so people googling for these errors can find this Q&A. I spent an hour trying to understand why I was getting that error. This is a surprisingly ungoogleable topic!)
Upvotes: 9