Jordan Correa
Jordan Correa

Reputation: 41

What are the differences between the 'data.frame', 'tribble' and 'tibble' functions?

What are the differences between the data.frame, tribble, and tibble functions? Which is easier and which is more useful for analyzing lots of data? I'm creating a data frame and I don't know which one to choose.

Upvotes: 4

Views: 2311

Answers (1)

brandizzi
brandizzi

Reputation: 27060

Data Frames

A data frame is a table-like structure with rows and columns, where each column can hold different types of values. Its purpose is similar to spreadsheets or SQL tables. Below there's a simple example to illustrate.

Example

Suppose you have data about people, including their name, age, and employment status. You can store this data in vectors like this:

> names <- c('John', 'Sylvia', 'Arthemis')
> age <- c(32, 16, 21)
> employed <- c(TRUE, FALSE, TRUE)

A data frame allows us to combine all the data related to a person in one row. To create it, we pass the vectors as arguments to data.frame():

> df <- data.frame(Name = names, Age = age, Working = employed)
> df
      Name Age Working
1     John  32    TRUE
2   Sylvia  16   FALSE
3 Arthemis  21    TRUE

This format makes the data much clearer. Data frames also simplify many operations. For example, filtering:

> df[df$Age > 20, ]
      Name Age Working
1     John  32    TRUE
3 Arthemis  21    TRUE

Data frames are versatile and useful for many data manipulation tasks, including filtering, aggregating, and plotting.

Tibbles

Tibbles are a modern version of data frames, part of the tidyverse collection of packages. They offer some subtle improvements over traditional data frames.

Differences from Data Frames

One difference is that tibbles display more information:

> t <- tibble(Name = names, Age = age, Working = employed)
> t
# A tibble: 3 × 3
  Name      Age Working
  <chr>   <dbl> <lgl>  
1 John       32 TRUE   
2 Sylvia     16 FALSE  
3 Arthemis   21 TRUE   

Tibbles also avoid some confusing features found in data frames. For instance, with data frames, you can access a column using only the beginning of its name:

> df$N
[1] "John" "Sylvia" "Arthemis"

It may seem practical, this can lead to hard-to-understand code and potential bugs if multiple columns share a prefix. In contrast, tibbles will return NULL and display a warning if you try this:

> t$N
NULL
Warning message:
Unknown or uninitialized column: `N`. 

For more differences, see the Tibble documentation.

The tribble() Function

While tibble() creates tibbles from vectors, tribble() offers a different way to create them, using a more readable syntax for defining columns and rows directly.

> t2 <- tribble(
+  ~Name,      ~Age, ~`Employment status`,
+  "John",      32,   TRUE,
+  "Sylvia",    16,   FALSE,
+  "Arthemis",  21,   TRUE
+ )

The tribble() function is particularly useful for creating small datasets for examples and testing. However, the resulting object is the same as one created with tibble():

# A tibble: 3 × 3
  Name      Age `Employment status`
  <chr>   <dbl> <lgl>              
1 John       32 TRUE               
2 Sylvia     16 FALSE              
3 Arthemis   21 TRUE               

Choosing Between Data Frames and Tibbles

Both data frames and tibbles are useful, but some contexts may favor one over the other.

  • If you are not using tidyverse, traditional data frames are likely more convenient.
  • If you are using tidyverse, tibbles are preferable to avoid some of the confusing behaviors of data frames.

When creating tibbles, choose the function based on your data source:

  • Use tibble() when reading data from files or vectors.
  • Use tribble() when entering hardcoded values.

Avoiding Common Mistakes

Although tibble() and tribble() return the same type of object, their argument structures differ drastically. Their similar names often cause confusion.

If you mistakenly use tibble() with tribble() arguments, you'll get an error like this:

> # ❌ WRONG!
> tibble(
  ~Name, ~Age, ~`Employment status`,
  "John", 32, TRUE
)
Error:
! All columns in a tibble must be vectors.
✖ Column `~Name` is a `formula` object.
Run `rlang::last_error()` to see where the error occurred.

Conversely, using tribble() with tibble() arguments will result in:

> # ❌ WRONG!
> tribble(Name = names, Age = age, Working = employed)
Error:
! Must specify at least one column using the `~name` syntax.
Run `rlang::last_error()` to see where the error occurred.

If you encounter errors like these, double-check that you're using the correct function and argument structure.

(I'm posting this addendum so people googling for these errors can find this Q&A. I spent an hour trying to understand why I was getting that error. This is a surprisingly ungoogleable topic!)

Upvotes: 9

Related Questions