Reputation: 11
I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
Upvotes: 1
Views: 2168
Reputation: 43199
Using a dplyr
solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.
Upvotes: 1
Reputation: 12165
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx)
returns a logical vector the length of df$lnx
telling, for each row, whether lnx is NA
. df$x == 0
does the same thing, checking whether, for each row, x == 0
. By using the &
operator, we combine those vectors into one that contains TRUE
only for rows where both conditions are TRUE
.
We then use the bracket notation to select the lnx
column of those rows where both conditions are TRUE
in df
and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1)
and df$lnx[is.na(df$lnx)]
are different lengths. log(data.frame$x +1)
produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)]
is the number of rows that have NA
in lnx
Upvotes: 1