Joakim
Joakim

Reputation: 2152

Dplyr: Add columns containing values from matching rows

Example data:

library(tidyverse)

cars <- tribble(
  ~make,    ~year,    ~hp,  ~disp,
  "Volvo",   2000,    111,    160,
  "Volvo",   2004,    105,    161,
  "Volvo",   2006,     93,    108,
  "Honda",   2000,    110,    258,
  "Honda",   2003,    175,    360,
  "Honda",   2003,    105,    225,
  "Honda",   2006,    245,    360,
  "Ford",    2000,     62,    146,
  "Ford",    2005,     95,    140,
  "Ford",    2007,    123,    167,
  "Ford",    2009,    123,    167
)


For each row, I want to add the columns prev_hp and prev_disp, which should contain the hp and disp values from from the same make, 2-4 years prior. The result should look like this:

tribble(
    ~make,    ~year,    ~hp,  ~disp,  ~prev_hp,  ~prev_disp,
    "Volvo",   2000,    111,    160,        NA,          NA,
    "Volvo",   2004,    105,    161,       111,         160,
    "Volvo",   2006,     93,    108,       105,         161,
    "Honda",   2000,    110,    258,        NA,          NA,
    "Honda",   2003,    175,    360,       110,         258,
    "Honda",   2004,    105,    225,       110,         360,
    "Honda",   2006,    245,    360,      175*,        225*,
    "Ford",    2000,     62,    146,        NA,          NA,
    "Ford",    2005,     95,    140,        NA,          NA,
    "Ford",    2007,    123,    167,        95,         140,
    "Ford",    2009,    123,    167,       123,         167
)

* For this row, two previous rows matches the criteria (Honda in 2003 and in 2004). In this case, the values from either match can be used, or even NA.

I've tried this:

cars %>%
rowwise() %>%
mutate(
    prev_hp = cars[cars$make == make & (cars$year >= year - 4 | cars$year <= year - 2), 'hp'],
    prev_disp = cars[cars$make == make & (cars$year >= year - 4 | cars$year <= year - 2), 'disp']
)

but it fails when there are more or less than exactly one matching previous row, and it's also very inefficient.

Is there a better way to solve this, or at least some way to fix my solution?

(the real dataset is 100k rows)

Upvotes: 1

Views: 44

Answers (1)

Anoushiravan R
Anoushiravan R

Reputation: 21938

I'm not sure if this is the output you are looking for. Let's take a look:

library(dplyr)

cars <- tribble(
  ~make,    ~year,    ~hp,  ~disp,
  "Volvo",   2000,    111,    160,
  "Volvo",   2004,    105,    161,
  "Volvo",   2006,     93,    108,
  "Honda",   2000,    110,    258,
  "Honda",   2003,    175,    360,
  "Honda",   2003,    105,    225,
  "Honda",   2006,    245,    360,
  "Ford",    2000,     62,    146,
  "Ford",    2005,     95,    140,
  "Ford",    2007,    123,    167,
  "Ford",    2009,    123,    167
)

cars %>%
  group_by(make) %>%
  mutate(across(hp:disp, ~ lag(.x), .names = "prev_{.col}"),
         across(prev_hp:prev_disp, ~ ifelse(year - lag(year) > 4, NA, .x)))

# A tibble: 11 x 6
# Groups:   make [3]
   make   year    hp  disp prev_hp prev_disp
   <chr> <dbl> <dbl> <dbl>   <dbl>     <dbl>
 1 Volvo  2000   111   160      NA        NA
 2 Volvo  2004   105   161     111       160
 3 Volvo  2006    93   108     105       161
 4 Honda  2000   110   258      NA        NA
 5 Honda  2003   175   360     110       258
 6 Honda  2003   105   225     175       360
 7 Honda  2006   245   360     105       225
 8 Ford   2000    62   146      NA        NA
 9 Ford   2005    95   140      NA        NA
10 Ford   2007   123   167      95       140
11 Ford   2009   123   167     123       167

Upvotes: 2

Related Questions