Reputation: 95
I have an Excel sheet formatted like the following:
2015 2015 2016 2016 2017 2017 Name Height Weight Height Weight Height Weight Alice 12 34 56 78 90 12 Bob 55 55 55 55 55 55 ...
My goal is to produce a tidy data frame like:
Name Year Height Weight Alice 2015 12 34 Alice 2016 56 78 Alice 2017 90 12 Bob 2015 55 55 Bob 2016 55 55 Bob 2017 55 55 ...
I can see how one might use read_excel
and then gather
from the tidyverse
to create a data frame if the year row didn't exist, but I don't understand how to do this with two header rows. The main point where I'm stuck is that obviously a column can only have one name, but it seems like I want to at least temporarily have two names for each column. What's the best way of doing this?
Upvotes: 1
Views: 890
Reputation: 16881
This is tricky but common in getting data out of Excel files. I pasted your data into an xlsx file and read in with readxl::read_excel
, but for the sake of reproducibility, I'm also pasting the dput
output here. I set col_names = F
so it would just have dummy column names, giving me each of those two header levels in rows, as shown below:
library(dplyr)
library(tidyr)
# df <- readxl::read_excel("multicols.xlsx", col_names = F)
df <- structure(list(...1 = c(NA, "Name", "Alice", "Bob"), ...2 = c("2015",
"Height", "12", "55"), ...3 = c("2015", "Weight", "34", "55"),
...4 = c("2016", "Height", "56", "55"), ...5 = c("2016",
"Weight", "78", "55"), ...6 = c("2017", "Height", "90", "55"
), ...7 = c("2017", "Weight", "12", "55")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
df
#> # A tibble: 4 x 7
#> ...1 ...2 ...3 ...4 ...5 ...6 ...7
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA> 2015 2015 2016 2016 2017 2017
#> 2 Name Height Weight Height Weight Height Weight
#> 3 Alice 12 34 56 78 90 12
#> 4 Bob 55 55 55 55 55 55
The years are in the first row and the measures are in the second, so I pull each of those out:
(yrs <- df[1,])
#> # A tibble: 1 x 7
#> ...1 ...2 ...3 ...4 ...5 ...6 ...7
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA> 2015 2015 2016 2016 2017 2017
(cols <- df[2,])
#> # A tibble: 1 x 7
#> ...1 ...2 ...3 ...4 ...5 ...6 ...7
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Name Height Weight Height Weight Height Weight
Then I paste those two together to get a vector of decent column names:
clean_names <- stringr::str_remove(paste(cols, yrs, sep = "_"), "_NA")
clean_names
#> [1] "Name" "Height_2015" "Weight_2015" "Height_2016" "Weight_2016"
#> [6] "Height_2017" "Weight_2017"
Now I can remove those two rows and set proper names:
df %>%
slice(-1:-2) %>%
setNames(clean_names)
#> # A tibble: 2 x 7
#> Name Height_2015 Weight_2015 Height_2016 Weight_2016 Height_2017
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Alice 12 34 56 78 90
#> 2 Bob 55 55 55 55 55
#> # … with 1 more variable: Weight_2017 <chr>
Lastly, reshape the data into a long shape, separate the key into the measure (height or weight) and year, then spread back into a wide shape.
df %>%
slice(-1:-2) %>%
setNames(clean_names) %>%
gather(key, value, -Name) %>%
separate(key, into = c("measure", "year")) %>%
spread(key = measure, value)
#> # A tibble: 6 x 4
#> Name year Height Weight
#> <chr> <chr> <chr> <chr>
#> 1 Alice 2015 12 34
#> 2 Alice 2016 56 78
#> 3 Alice 2017 90 12
#> 4 Bob 2015 55 55
#> 5 Bob 2016 55 55
#> 6 Bob 2017 55 55
Upvotes: 2
Reputation: 6343
What you are trying to do is called melting or reshaping your data from wide format to long format.
Suppose your data is stored in dt
. Here's a data.table
approach (which has nicer syntax than tidyverse):
library(data.table)
dt2 <- data.table(melt(dt, id.vars = "Name", variable.name = "Measurement", value.name = "Value"), year = rep(rep(2015:2017, each = 2), times = 2))
Output:
> dt2
Name Measurement Value year
1: Alice Height 12 2015
2: Bob Height 55 2015
3: Alice Weight 34 2016
4: Bob Weight 55 2016
5: Alice Height 56 2017
6: Bob Height 55 2017
7: Alice Weight 78 2015
8: Bob Weight 55 2015
9: Alice Height 90 2016
10: Bob Height 55 2016
11: Alice Weight 12 2017
12: Bob Weight 55 2017
You will notice that I have Weight
and Height
measurements in the same column. I recommend this instead of having a separate column for each variable because it is compatible with group by
syntax.
The thing that is curious about your data is that you have two rows as headers. This means you will have to adjust the year = ...
argument in my answer depending on your data.
In general, to create your year
column, you need:
year = rep(rep(`insert year start`:`insert year end`),
times = dt$Name %>% unique %>% length))
Upvotes: 1
Reputation: 1268
This is a reasonably common problem (people actually use excel workbooks like that) but one that involves several steps to resolve in R. Here's one way to do it where I assume your dataframe is called dat
:
library(dplyr)
library(tidyr)
library(magrittr) # for the two-way pipe %<>%
# Start by renaming your columns to include both the year and variable
# The use of '-' to separate the parts is for convenience in the regex below
names(dat)[2:ncol(dat)] <- paste(dat[1, 2:ncol(dat)],
names(dat)[2:ncol(dat)],
sep = "-")
names(dat)[1] <- "Name"
names(dat) <- sub("__\\d+", "", names(dat))
# Drop the now useless first row
dat <- dat[2:nrow(dat), ]
# Transform the data
dat %<>%
gather(key = var, value = val, -Name) %>%
mutate(Year = sub("^.*?-", "", var),
var = sub("-\\d+$", "", var)) %>%
spread(key = var, value = as.numeric(val))
The piped sequence that transforms the data works like this: First, your intuition was right in that we need to use gather
to transform from wide to long. Second, we create our "Year" variable and get rid of those numbers from the temporary "var" variable. Finally, we have to use spread
separate the Height and Weight variables. Since the second header row in the original data was text, we'll also transform those values into numeric in that step.
Upvotes: 4