Reputation: 2638
I have a dataset I need to pivot_longer() with two kinds of variables in its columns: quantity and proportion. Quantity is entered under the species name, and proportion is entered under P followed by the species name. Each row represents the data for an hour-long sampling period.
Here is a simplified version of the data I'm working with. (EDIT: I added another column to the code, RH_percent, to help refine the regexp.)
#code to recreate input data
sampledata <- read_csv("date,time,RH_percent,Cx_tarsalis,Ps_columb,PCx_tarsalis,PPs_columb
2020-07-20,19:00:00,0.25,3,4,0.03,0.04
2020-07-20,20:00:00,0.5,6,8,0.06,0.08
2020-07-20,21:00:00,0.75,9,12,0.09,0.12")
This is what I want my output to look like:
#code to recreate desired output data
sampleoutput <- read_csv("date,time,RH_percent,species,quantity,P
2020-07-20,19:00:00,0.25,Cx_tarsalis,3,0.03
2020-07-20,19:00:00,0.25,Ps_columb,4,0.04
2020-07-20,20:00:00,0.5,Cx_tarsalis,6,0.06
2020-07-20,20:00:00,0.5,Ps_columb,8,0.08
2020-07-20,21:00:00,0.75,Cx_tarsalis,9,0.09
2020-07-20,21:00:00,0.75,Ps_columb,12,0.12")
I know the code will look something like this, and I know I need to specify a regexp in the names_pattern argument:
sampledata %>% pivot_longer(cols = -c(date,time),
names_to = c(".value","species"),
names_pattern = "")
I have been studying examples online, including the Pivoting vignette and Roger Peng's regexp videos on youtube but have not hit on the correct names_pattern to get the output I need.
A similar problem is solved in: pivot_longer multiple variables of different kinds, but the way the variables are represented in the columns, and thus the pattern in the regexp, are substantially different from what I need. Thank you!
Upvotes: 3
Views: 428
Reputation: 887291
Here is one approach with rename
ing
-quantity
-P
names_sep
as -
and specify the names_to
in the order species
, .value
in pivot_longer
library(dplyr)
library(stringr)
library(tidyr)
sampledata %>%
rename_with(~ str_c(., '-quantity'), 3:4) %>%
rename_with(~ str_c(str_remove(., '^P'), '-P'), 5:6) %>%
pivot_longer(cols = -c(date,time),
names_to = c("species", ".value"),
names_sep = "-")
-output
# A tibble: 6 x 5
date time species quantity P
<date> <time> <chr> <dbl> <dbl>
1 2020-07-20 19:00 Cx_tarsalis 3 0.03
2 2020-07-20 19:00 Ps_columb 4 0.04
3 2020-07-20 20:00 Cx_tarsalis 6 0.06
4 2020-07-20 20:00 Ps_columb 8 0.08
5 2020-07-20 21:00 Cx_tarsalis 9 0.09
6 2020-07-20 21:00 Ps_columb 12 0.12
Or another approach is just to add a prefix where the column names have only two character before the _
i.e. those are the quantity columns 'Q'. Then, in the names_pattern
, capture the first character ((.)
) as the first capture group, then the rest of characters ((.*)
) as second, which will denote for ".value" and "species" specified in names_to
sampledata %>%
rename_with(~ str_c('Q', .), matches('^.._')) %>%
pivot_longer(cols = -c(date, time),
names_to = c(".value", "species"), names_pattern = "^(.)(.*)") %>%
rename(quantity = Q)
# A tibble: 6 x 5
date time species quantity P
<date> <time> <chr> <dbl> <dbl>
1 2020-07-20 19:00 Cx_tarsalis 3 0.03
2 2020-07-20 19:00 Ps_columb 4 0.04
3 2020-07-20 20:00 Cx_tarsalis 6 0.06
4 2020-07-20 20:00 Ps_columb 8 0.08
5 2020-07-20 21:00 Cx_tarsalis 9 0.09
6 2020-07-20 21:00 Ps_columb 12 0.12
With the OP's new dataset, use ignore.case = FALSE
in matches
as by default it is TRUE
sampledata %>%
rename_with(~ str_c('Q', .), matches('^[A-Z][a-z]_[a-z]',
ignore.case = FALSE)) %>%
pivot_longer(cols = -c(date, time, RH_percent),
names_to = c(".value", "species"), names_pattern = "^(.)(.*)") %>%
rename(quantity = Q)
-output
# A tibble: 6 x 6
date time RH_percent species quantity P
<date> <time> <dbl> <chr> <dbl> <dbl>
1 2020-07-20 19:00 0.25 Cx_tarsalis 3 0.03
2 2020-07-20 19:00 0.25 Ps_columb 4 0.04
3 2020-07-20 20:00 0.5 Cx_tarsalis 6 0.06
4 2020-07-20 20:00 0.5 Ps_columb 8 0.08
5 2020-07-20 21:00 0.75 Cx_tarsalis 9 0.09
6 2020-07-20 21:00 0.75 Ps_columb 12 0.12
Upvotes: 3