E_H
E_H

Reputation: 231

How to restructure data frame for a certain kind of boxplot

I work with R and Rstudio. I got my hands on a longitudinal data frame, it looks basically like this:

trait_A_time_1 <- c("2.2","2.9","1.4","3.6")
trait_A_time_2 <- c("4.2","3.2","2.1","4.0")
trait_A_time_3 <- c("2.2","2.5","3.4","1.9")
trait_A_time_4 <- c("3.2","3.9","4.5","4.7")
trait_A_time_5 <- c("2.8","3.3","4.0","1.1")

df <- data.frame(trait_A_time_1, trait_A_time_2, trait_A_time_3, trait_A_time_4, trait_A_time_5)

print (df)

  trait_A_time_1 trait_A_time_2 trait_A_time_3 trait_A_time_4 trait_A_time_5
1            2.2            4.2            2.2            3.2            2.8
2            2.9            3.2            2.5            3.9            3.3
3            1.4            2.1            3.4            4.5            4.0
4            3.6            4.0            1.9            4.7            1.1

It measured a certain psychological trait in persons over a few weeks and measurement occasions. And now I want to make a boxplot that looks like this:

desired boxplot

x axis (groups): the four occasions of measurment
y axis: levels of trait A in the sample

I tried this code:

p <- ggplot(data2, aes(x=, y=)) + 
  geom_violin()
p

But it does not work since I have no dedicated variables for the occasions or the level of A. How exactly can I get those? How do I have to transpose/restructure this dataset, to get my desired boxplots?

Upvotes: 0

Views: 169

Answers (1)

PLY
PLY

Reputation: 571

I added some sample data. This should do it

library(tidyverse)

df <- tibble(`trait A time 1` = c(3.3, 2.1, rnorm(10)),
             `trait A time 2` = c(4.1, 2.2, rnorm(10)),
             `trait A time 5` = c(3.9, 1.9, rnorm(10)))

df %>% 
  rename_with(.fn = function(x) gsub('trait A time', "", x)) %>%
  pivot_longer(cols = everything()) %>%
  ggplot(data = .,
         aes(x = name, y = value)) +
  geom_violin() +
  labs(x = "time", y = "trait A")

which results into enter image description here

You don't necessarily have to rename like I did here, the gist of the code is in the pivoting with pivot_longer.

EDIT:

As per request, I will try and shortly explain what the first two lines do. rename_with() is a functon from the dplyr package that is able to rename column names. It allows several options to rename columns, but in this case I provided a function to rename all columns names. The function simply replaces 'trait A time' in any column name for an empty character ''. It is not the cleanest thing to do, but it serves its purpose.

pivot_longer() is a very niche function (also from dplyr) which you will likely use more often from now if you are going to continue to work with R. Essentially, it is able to transform the dataframe you have into a dataframe with more rows --- making it a longer dataframe. Long dataframes are usually the way to go for plotting with ggplot. It creates a name column and a value column, but the names of these columns can also be changed. Notice that every row of this long dataframe provides info for only 1 observation, namely an observation with corresponding name (measurement time in your case) and its corresponding value. Before, you had a wider dataframe that contains information of more than 1 observation, which you should maybe imagine it being harder to plot if there is too much info per row to plot.

df %>% 
  rename_with(.fn = function(x) gsub('trait A time', "", x)) %>%
  pivot_longer(cols = everything()) %>% 
  print()
#> # A tibble: 36 x 2
#>    name   value
#>    <chr>  <dbl>
#>  1 " 1"   3.3  
#>  2 " 2"   4.1  
#>  3 " 5"   3.9  
#>  4 " 1"   2.1  
#>  5 " 2"   2.2  
#>  6 " 5"   1.9  
#>  7 " 1"   0.293
#>  8 " 2"   0.274
#>  9 " 5"  -0.869
#> 10 " 1"   2.30 
#> # ... with 26 more rows

Upvotes: 2

Related Questions