Brian Carmicle
Brian Carmicle

Reputation: 95

Reshape multivariable data with two rows as headers (wide to long)

I have an Excel sheet formatted like the following:

             2015    2015    2016    2016    2017    2017
Name         Height  Weight  Height  Weight  Height  Weight  
Alice        12      34      56      78      90      12      
Bob          55      55      55      55      55      55     
...

My goal is to produce a tidy data frame like:

Name    Year    Height    Weight
Alice   2015    12        34
Alice   2016    56        78
Alice   2017    90        12
Bob     2015    55        55
Bob     2016    55        55
Bob     2017    55        55
...

I can see how one might use read_excel and then gather from the tidyverse to create a data frame if the year row didn't exist, but I don't understand how to do this with two header rows. The main point where I'm stuck is that obviously a column can only have one name, but it seems like I want to at least temporarily have two names for each column. What's the best way of doing this?

Upvotes: 1

Views: 890

Answers (3)

camille
camille

Reputation: 16881

This is tricky but common in getting data out of Excel files. I pasted your data into an xlsx file and read in with readxl::read_excel, but for the sake of reproducibility, I'm also pasting the dput output here. I set col_names = F so it would just have dummy column names, giving me each of those two header levels in rows, as shown below:

library(dplyr)
library(tidyr)

# df <- readxl::read_excel("multicols.xlsx", col_names = F)
df <- structure(list(...1 = c(NA, "Name", "Alice", "Bob"), ...2 = c("2015", 
                                                              "Height", "12", "55"), ...3 = c("2015", "Weight", "34", "55"), 
               ...4 = c("2016", "Height", "56", "55"), ...5 = c("2016", 
                                                                "Weight", "78", "55"), ...6 = c("2017", "Height", "90", "55"
                                                                ), ...7 = c("2017", "Weight", "12", "55")), row.names = c(NA, 
                                                                                                                          -4L), class = c("tbl_df", "tbl", "data.frame"))
df
#> # A tibble: 4 x 7
#>   ...1  ...2   ...3   ...4   ...5   ...6   ...7  
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 <NA>  2015   2015   2016   2016   2017   2017  
#> 2 Name  Height Weight Height Weight Height Weight
#> 3 Alice 12     34     56     78     90     12    
#> 4 Bob   55     55     55     55     55     55

The years are in the first row and the measures are in the second, so I pull each of those out:

(yrs <- df[1,])
#> # A tibble: 1 x 7
#>   ...1  ...2  ...3  ...4  ...5  ...6  ...7 
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA>  2015  2015  2016  2016  2017  2017
(cols <- df[2,])
#> # A tibble: 1 x 7
#>   ...1  ...2   ...3   ...4   ...5   ...6   ...7  
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#> 1 Name  Height Weight Height Weight Height Weight

Then I paste those two together to get a vector of decent column names:

clean_names <- stringr::str_remove(paste(cols, yrs, sep = "_"), "_NA")
clean_names
#> [1] "Name"        "Height_2015" "Weight_2015" "Height_2016" "Weight_2016"
#> [6] "Height_2017" "Weight_2017"

Now I can remove those two rows and set proper names:

df %>%
  slice(-1:-2) %>%
  setNames(clean_names)
#> # A tibble: 2 x 7
#>   Name  Height_2015 Weight_2015 Height_2016 Weight_2016 Height_2017
#>   <chr> <chr>       <chr>       <chr>       <chr>       <chr>      
#> 1 Alice 12          34          56          78          90         
#> 2 Bob   55          55          55          55          55         
#> # … with 1 more variable: Weight_2017 <chr>

Lastly, reshape the data into a long shape, separate the key into the measure (height or weight) and year, then spread back into a wide shape.

df %>%
  slice(-1:-2) %>%
  setNames(clean_names) %>%
  gather(key, value, -Name) %>%
  separate(key, into = c("measure", "year")) %>%
  spread(key = measure, value)
#> # A tibble: 6 x 4
#>   Name  year  Height Weight
#>   <chr> <chr> <chr>  <chr> 
#> 1 Alice 2015  12     34    
#> 2 Alice 2016  56     78    
#> 3 Alice 2017  90     12    
#> 4 Bob   2015  55     55    
#> 5 Bob   2016  55     55    
#> 6 Bob   2017  55     55

Upvotes: 2

Arturo Sbr
Arturo Sbr

Reputation: 6343

What you are trying to do is called melting or reshaping your data from wide format to long format.

Suppose your data is stored in dt. Here's a data.table approach (which has nicer syntax than tidyverse):

library(data.table)
dt2 <- data.table(melt(dt, id.vars = "Name", variable.name = "Measurement", value.name = "Value"), year = rep(rep(2015:2017, each = 2), times = 2))

Output:

> dt2
     Name Measurement Value year
 1: Alice      Height    12 2015
 2:   Bob      Height    55 2015
 3: Alice      Weight    34 2016
 4:   Bob      Weight    55 2016
 5: Alice      Height    56 2017
 6:   Bob      Height    55 2017
 7: Alice      Weight    78 2015
 8:   Bob      Weight    55 2015
 9: Alice      Height    90 2016
10:   Bob      Height    55 2016
11: Alice      Weight    12 2017
12:   Bob      Weight    55 2017

You will notice that I have Weight and Height measurements in the same column. I recommend this instead of having a separate column for each variable because it is compatible with group by syntax.

The thing that is curious about your data is that you have two rows as headers. This means you will have to adjust the year = ... argument in my answer depending on your data.

In general, to create your year column, you need:

year = rep(rep(`insert year start`:`insert year end`),
           times = dt$Name %>% unique %>% length))

Upvotes: 1

Brendan A.
Brendan A.

Reputation: 1268

This is a reasonably common problem (people actually use excel workbooks like that) but one that involves several steps to resolve in R. Here's one way to do it where I assume your dataframe is called dat:

library(dplyr)
library(tidyr)
library(magrittr) # for the two-way pipe %<>%

# Start by renaming your columns to include both the year and variable
# The use of '-' to separate the parts is for convenience in the regex below
names(dat)[2:ncol(dat)] <- paste(dat[1, 2:ncol(dat)],
                                 names(dat)[2:ncol(dat)],
                                 sep = "-")
names(dat)[1] <- "Name"
names(dat) <- sub("__\\d+", "", names(dat))

# Drop the now useless first row
dat <- dat[2:nrow(dat), ]

# Transform the data
dat %<>%
  gather(key = var, value = val, -Name) %>%
  mutate(Year = sub("^.*?-", "", var),
         var = sub("-\\d+$", "", var)) %>%
  spread(key = var, value = as.numeric(val))

The piped sequence that transforms the data works like this: First, your intuition was right in that we need to use gather to transform from wide to long. Second, we create our "Year" variable and get rid of those numbers from the temporary "var" variable. Finally, we have to use spread separate the Height and Weight variables. Since the second header row in the original data was text, we'll also transform those values into numeric in that step.

Upvotes: 4

Related Questions