Susie Derkins
Susie Derkins

Reputation: 2638

Coding a regexp in the names_pattern argument of pivot_longer for multiple variables differentiated by presence or absence of a prefix

I have a dataset I need to pivot_longer() with two kinds of variables in its columns: quantity and proportion. Quantity is entered under the species name, and proportion is entered under P followed by the species name. Each row represents the data for an hour-long sampling period.

Here is a simplified version of the data I'm working with. (EDIT: I added another column to the code, RH_percent, to help refine the regexp.)

sample input data

#code to recreate input data
sampledata <- read_csv("date,time,RH_percent,Cx_tarsalis,Ps_columb,PCx_tarsalis,PPs_columb
2020-07-20,19:00:00,0.25,3,4,0.03,0.04
2020-07-20,20:00:00,0.5,6,8,0.06,0.08
2020-07-20,21:00:00,0.75,9,12,0.09,0.12")

This is what I want my output to look like: sample output data

#code to recreate desired output data
sampleoutput <- read_csv("date,time,RH_percent,species,quantity,P
         2020-07-20,19:00:00,0.25,Cx_tarsalis,3,0.03
         2020-07-20,19:00:00,0.25,Ps_columb,4,0.04
         2020-07-20,20:00:00,0.5,Cx_tarsalis,6,0.06
         2020-07-20,20:00:00,0.5,Ps_columb,8,0.08
         2020-07-20,21:00:00,0.75,Cx_tarsalis,9,0.09
         2020-07-20,21:00:00,0.75,Ps_columb,12,0.12")

I know the code will look something like this, and I know I need to specify a regexp in the names_pattern argument:

sampledata %>% pivot_longer(cols = -c(date,time),
                            names_to = c(".value","species"),
                            names_pattern = "")

I have been studying examples online, including the Pivoting vignette and Roger Peng's regexp videos on youtube but have not hit on the correct names_pattern to get the output I need.

A similar problem is solved in: pivot_longer multiple variables of different kinds, but the way the variables are represented in the columns, and thus the pattern in the regexp, are substantially different from what I need. Thank you!

Upvotes: 3

Views: 428

Answers (1)

akrun
akrun

Reputation: 887291

Here is one approach with renameing

  1. Rename the columns 3 and 4 by adding a suffix -quantity
  2. Rename the columns 5 and 6 by adding suffix -P
  3. Use the names_sep as - and specify the names_to in the order species, .value in pivot_longer
library(dplyr)
library(stringr)
library(tidyr)
sampledata %>%
    rename_with(~ str_c(., '-quantity'), 3:4) %>%
    rename_with(~ str_c(str_remove(., '^P'), '-P'), 5:6) %>%    
   pivot_longer(cols = -c(date,time),
                            names_to = c("species", ".value"),
                            names_sep = "-")

-output

# A tibble: 6 x 5
  date       time   species     quantity     P
  <date>     <time> <chr>          <dbl> <dbl>
1 2020-07-20 19:00  Cx_tarsalis        3  0.03
2 2020-07-20 19:00  Ps_columb          4  0.04
3 2020-07-20 20:00  Cx_tarsalis        6  0.06
4 2020-07-20 20:00  Ps_columb          8  0.08
5 2020-07-20 21:00  Cx_tarsalis        9  0.09
6 2020-07-20 21:00  Ps_columb         12  0.12

Or another approach is just to add a prefix where the column names have only two character before the _ i.e. those are the quantity columns 'Q'. Then, in the names_pattern, capture the first character ((.)) as the first capture group, then the rest of characters ((.*)) as second, which will denote for ".value" and "species" specified in names_to

sampledata %>%
     rename_with(~ str_c('Q', .), matches('^.._')) %>% 
     pivot_longer(cols = -c(date, time), 
        names_to = c(".value", "species"), names_pattern = "^(.)(.*)") %>% 
     rename(quantity = Q)
# A tibble: 6 x 5
  date       time   species     quantity     P
  <date>     <time> <chr>          <dbl> <dbl>
1 2020-07-20 19:00  Cx_tarsalis        3  0.03
2 2020-07-20 19:00  Ps_columb          4  0.04
3 2020-07-20 20:00  Cx_tarsalis        6  0.06
4 2020-07-20 20:00  Ps_columb          8  0.08
5 2020-07-20 21:00  Cx_tarsalis        9  0.09
6 2020-07-20 21:00  Ps_columb         12  0.12

Update

With the OP's new dataset, use ignore.case = FALSE in matches as by default it is TRUE

sampledata %>%
     rename_with(~ str_c('Q', .), matches('^[A-Z][a-z]_[a-z]', 
       ignore.case = FALSE)) %>% 
     pivot_longer(cols = -c(date, time, RH_percent), 
        names_to = c(".value", "species"), names_pattern = "^(.)(.*)") %>% 
     rename(quantity = Q)

-output

# A tibble: 6 x 6
  date       time   RH_percent species     quantity     P
  <date>     <time>      <dbl> <chr>          <dbl> <dbl>
1 2020-07-20 19:00        0.25 Cx_tarsalis        3  0.03
2 2020-07-20 19:00        0.25 Ps_columb          4  0.04
3 2020-07-20 20:00        0.5  Cx_tarsalis        6  0.06
4 2020-07-20 20:00        0.5  Ps_columb          8  0.08
5 2020-07-20 21:00        0.75 Cx_tarsalis        9  0.09
6 2020-07-20 21:00        0.75 Ps_columb         12  0.12

Upvotes: 3

Related Questions