Boris
Boris

Reputation: 131

How to transform negative values into dummy variables?

I want to create two dummy variables: a) one that captures all negative changes in the x1. If there is a negative change ==1, otherwise ==0.

And b) that captures all -1 (and higher) changes. For example: 10.5 to 9.5 or from 10 to 9(or from 10 to 6). This one also as dummy: if -1 or more change then ==1, otherwise ==0.

Sine the data looks something like this, the variable should capture negative values for each personID.

   personid  year   x1
    33       1990    0
    33       1991    3.5
    33       1992    2.75
    33       1993    3.25
    33       1994    6
    34       1990    17
    34       1991    9
    34       1992    16.5
    34       1993    16.75

For replication, use the code below.

set.seed(100)
mydata <- data.frame(
  x1    = sample(c(0:30, 1.5,5.75,9.25,10.25,11.75), 100, replace = TRUE),
  personID  = rep(c(1:10), each = 10)
  )

I tried to generate these variables using ave...it doesn't help much. I know that I am not using it correctly but not sure where..

mydata$a <- with(mydata, ave(x1, personID, FUN = function(x) c(TRUE, diff(x) !=-1) & x!=-1))

EDIT:

dput(data)
structure(list(personid = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 20L, 20L, 20L, 20L, 20L, 20L, 
20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 40L, 40L, 
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 
40L, 40L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L, 
41L, 41L, 41L, 41L, 41L, 41L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 
42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 42L, 51L, 51L, 51L, 
51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L), x1 = c(37, 34, 30.75, 29, 37, 32.25, 25.75, 32.5, 27, 31, 
28.5, 23.75, 25.75, 28.5, 28.5, 27.75, 25.75, 25.75, 27.25, 31, 
32.5, 35.5, 27.25, 32.25, 30.5, 28.75, 29.5, 29, 29, 27, 28.75, 
28.75, 25.75, 25.75, 22, 22, 29, 30, 20, 22, 12, 11.5, 10, 14.5, 
24, 15.5, 23.5, 14, 24, 10, 9, 34, 16, 9.5, 19, 31, 20, 9.5, 
9.5, 21, 29, 20, 26, 26, 24.5, 5, 16.5, 18.5, 22.5, 31.5, 23.5, 
20, 15.25, 20.75, 32, 23.5, 25, 20, 27, 22.5, 24.5, 28.5, 18, 
17.5, 18.5, 34, 30.5, 32.5, 31, 27, 31, 31, 35.5, 31, 31, 29, 
31.5, 29.25, 31, 31, 28, 29)), .Names = c("personid", "x1"), class = "data.frame", row.names = c(NA, 
-102L))

Upvotes: 1

Views: 688

Answers (2)

aichao
aichao

Reputation: 7435

You can also use dplyr:

library(dplyr)

result <- mydata %>% group_by(personID) %>%
                     mutate(a = ifelse((x1-lag(x1)) < 0, 1, 0)) %>%
                     mutate(b = ifelse((x1-lag(x1)) <= -1, 1, 0))

Here, we detect change group_by each personID. The function mutate creates your dummy variable columns a and b. Instead of using diff, test by subtracting the lag(x1) from x1. The results using your simulated data with seed=100 except I replaced x1 with 10.5 in row 2 to illustrate a case where a is 1 but b is 0:

print(result)
##Source: local data frame [100 x 4]
##Groups: personID [10]

##      x1 personID     a     b
##   <dbl>    <int> <dbl> <dbl>
##1     11        1    NA    NA
##2   10.5        1     1     0
##3     19        1     0     0
##4      2        1     1     1
##5     16        1     0     0
##6     17        1     0     0
##7     29        1     0     0
##8     13        1     1     1
##9     19        1     0     0
##10     6        1     1     1

Alternatively, we can use diff to test the conditions, but we then need to prepend the result with NA so that what is returned by the function used by mutate has the same length as what is input:

result <- data %>% group_by(personid) %>%
                   mutate(a = c(NA, ifelse(diff(x1) < 0, 1, 0))) %>%
                   mutate(b = c(NA, ifelse(diff(x1) <= -1, 1, 0)))

Upvotes: 0

Ben Bolker
Ben Bolker

Reputation: 226172

What you're looking for is a combination of (1) some split-apply-combine approach (tapply in base R, ddply in plyr, group_by + mutate in plyr ... and (2) diff.

Data:

set.seed(100)
mydata <- data.frame(
  x1    = sample(c(0:30, 1.5,5.75,9.25,10.25,11.75), 100, replace = TRUE),
  personID  = rep(c(1:10), each = 10)
)

You'll have to decide what you want to do about the first/last value in each individual's sequence: is the (first, last) value equal to (NA, 0) ? Here I'm setting the first value to zero.

diff_to_dummy <- function(x) {
    c(0,as.numeric(diff(x) <(-1)))
}

Now tapply applies the function to x1 for each personID; unlist puts the values back together.

dval <- with(mydata,unlist(tapply(x1,list(personID),diff_to_dummy)))

Upvotes: 2

Related Questions