Shank
Shank

Reputation: 11

Transformation of missing values by taking log(x+1)

I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -

  a       b       c       d       e      f         x        lnx
AB1001   1.00    3.00    67.00   13.90   2.63    1776.7     7.48
AB1002   0.00    2.00    72.00   38.70   3.66    0.00       NA
AB1003   1.00    3.00    48.00   4.15    1.42    1917       7.56
AB1004   0.00    1.00    70.00   34.80   3.55    NA         NA
AB1005   1.00    1.00    34.00   3.45    1.24    3165.45    8.06
AB1006   1.00    1.00    14.00   7.30    1.99    NA         NA
AB1007   0.00    3.00    53.00   11.20   2.42    0.00       NA

I tried writing the following code -

data.frame$lnx[is.na(data.frame$lnx)] <-  log(data.frame$x +1)

but I get the following warning message and the output is wrong:

number of items to replace is not a multiple of replacement length. Can someone guide me please.

Thanks.

Upvotes: 1

Views: 2168

Answers (2)

Jan
Jan

Reputation: 43199

Using a dplyr solution:

library(dplyr)
df %>%
  mutate(lnx = case_when(
    x == 0.0 ~ 0,
    is.na(x) ~ NA_real_))

This yields for your example:

# A tibble: 7 x 8
  a          b     c     d     e     f     x   lnx
  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001    1.    3.   67. 13.9   2.63 1777.   NA 
2 AB1002    0.    2.   72. 38.7   3.66    0.    0.
3 AB1003    1.    3.   48.  4.15  1.42 1917.   NA 
4 AB1004    0.    1.   70. 34.8   3.55   NA    NA 
5 AB1005    1.    1.   34.  3.45  1.24 3165.   NA 
6 AB1006    1.    1.   14.  7.30  1.99   NA    NA 
7 AB1007    0.    3.   53. 11.2   2.42    0.    0.

Upvotes: 1

divibisan
divibisan

Reputation: 12165

In R you can select rows using conditionals and assign values directly. In you example you could do this:

df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0

Here's what this does:

is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.

We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-

The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx

Upvotes: 1

Related Questions